Physics of Language Models 4.2: LlamaCanon Release

Our released paper, Physics of Language Models: Part 4.1, Architecture Design and the Magic of Canon Layers, demonstrates that the Canon layer is a powerful architecture add-on that improves language model performance on multiple fronts using a synthetic pretraining playground, perhaps for every possible architecture (original Transformer or linear models).

In this release, we provide code and pre-trained models to showcase how these findings extend to real-world pretraining. Specifically, we compare the vanilla Llama architecture with our modified LlamaCanon variant, both pretrained under the same controlled settings.

Figure 1: Quick illustration of performance vs. model size/training time.

✨Highlights of the Release

Broad Model Availability: We release 16 base models (1B, 3B, and 8B) pretrained on the open-sourced Nemotron-CC dataset for 1T or 2T tokens.
Controlled Experiment: In each setting, we pretrain two versions of LlamaCanon (using two learning rates) and compare them against two corresponding versions of the original Llama pretrained with identical hyperparameters. This ensures a rigorous architectural comparison.
Performance Gain: LlamaCanon consistently surpasses Llama in all eight controlled comparisons, achieving, for instance, a 2% gain in the MMLU benchmark.
Comparison to Open Models: Our experiments are benchmarked against open-sourced models trained on similar datasets, ensuring that we study a realistic pretraining setup rather than an artificial scenario.

⚙️Model Configurations

A quick summary of the 16 models we release along with their parameters can be seen below:

Figure 2: Names and parameters of the released models.

🔗Links

📊Performance Metrics

The table below illustrates how LlamaCanon performs in comparison to vanilla Llama models, as well as some open-sourced pretraining benchmarks.

Figure 3: Cross-benchmark performance evaluation of the released models.

📈Training Curves

To further showcase the advantage of Canon layers over the entirety of the pretraining process, we provide detailed training-time performance curves. Interactive versions and additional benchmark metrics are available in our GitHub repository.

Figure 4: MMLU accuracy vs. training tokens.

📌Model Details

Model Type: Llama Transformer + LlamaCanon Transformer
Language: English
License: Apache 2.0
Type: Base model without any instruction fine-tuning or post-training.
Context length: 4096 tokens (+ ~50% for LlamaCanon).
- Note: The models were pretrained with context length 4096. However, unlike traditional RoPE transformers, LlamaCanon demonstrates strong length generalization, extending to ~50% more tokens (as detailed in our paper). While long-context fine-tuning could further enhance this capability, we have deliberately avoided it to maintain a clean and controlled comparison of base-model pretraining, highlighting the effectiveness of Canon layers.

🧩Installation and Dependencies

It is highly recommended to pip install causal-conv1d for CUDA efficiency, as our implementation of Canon layers relies on depth-wise conv1d. The code is tested with transformers==4.47.1 and 4.53.3 but should be compatible with many earlier versions. Ensure you enable trust_remote_code=True to download the architecture code automatically.

▶️Demo

The following sample demonstrates how to use our pre-trained models:

from transformers import AutoTokenizer, AutoModelForCausalLM

# Choose any of our 16 released models
# model_name = "facebook/PhysicsLM4.2__LlamaCanon-8B-Nemo-1T-lr0.003"
model_name = "facebook/PhysicsLM4.2__LlamaCanon-1B-Nemo-2T-lr0.005"
# model_name = "facebook/PhysicsLM4.2__Llama-3B-Nemo-1T-lr0.003"

# Below is simply a wrapper of either the Llama2 tokenizer (for <=3B models) 
#   or Llama3 (for 8B models); alternatively, you can download your own 
#   Huggingface llama2/3 tokenizers and use that instead
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)

model = AutoModelForCausalLM.from_pretrained(model_name, trust_remote_code=True).cuda()

input_text = "Galileo Galilei climbed the Leaning Tower of Pisa to conduct a controlled experiment."
inputs = tokenizer(input_text, return_tensors="pt")
output_ids = model.generate(inputs['input_ids'].cuda(), max_new_tokens=50)
print(tokenizer.decode(output_ids[0], skip_special_tokens=True))

⚠️Bias, Risks, and Limitations

The models are released for research purposes only (mainly for controlled experiments comparing Llama and LlamaCanon) and are not intended for applications requiring high factual accuracy, safety-critical use cases, or medical/health contexts. The models were pretrained on open datasets and are not safety- or alignment-tuned, meaning:

They may generate content that is factually incorrect, biased, harmful, or offensive.
Outputs may include objectionable content even if such outcomes weren't explicitly intended.
Users are responsible for ensuring appropriate evaluation and implementing additional filtering or safety mechanisms suitable for their specific use cases.

📖Citation

Please cite the following if you use our models or findings in your research:

@article{Allenzhu2025-canon,
  author = {{Allen-Zhu}, Zeyuan},
  title = {{Physics of Language Models: Part 4.1, Architecture Design and the Magic of Canon Layers}},
  year = {2025},
  month = {May},
  journal = {SSRN Electronic Journal},
  note = {\url{https://ssrn.com/abstract=5240330}}
}

Note: A technical report for this release will appear under Physics of Language Models: Part 4.2. Until then, please cite the above paper. Thank you!

Additional Resources

GitHub Repository includes
- Full training recipes, model configurations, and interactive plots (on all benchmarks).

Model Card Author

Zeyuan Allen-Zhu

facebook
/

PhysicsLM4.2__Llama-8B-Nemo-1T-lr0.003

You need to agree to share your contact information to access this model