You need to agree to share your contact information to access this model
The information you provide will be collected, stored, processed and shared in accordance with the Meta Privacy Policy.
Log in or Sign Up to review the conditions and access this model content.
MobileLLM-P1 Model Card
We are introducing MobileLLM-P1 or Pro, a 1B foundational language model in the MobileLLM series, designed to deliver high-quality, efficient on-device inference across a wide range of general language modeling tasks.
We open-source two variants of the model: A pre-trained base model along with quantized checkpoints for CPU and accelerator inference, as well as an instruction tuned version, showing competitive performance against models in the this size range on tasks like tool calling, question answering, rewriting and summarization.
Key Features
- Strong Pre-training Performance: MobileLLM-Pro base achieves impressive pre-training results, outperforming Gemma 3 1B and Llama 3.2 1B by on average 5.7% and 7.9% respectively on reasoning, knowledge, and long-context retrieval benchmarks. This performance is achieved by pre-training on less than 2T fully open-source tokens.
- 128k Context Window: The model supports up to 128k tokens, enabling long-context understanding for applications such as document summarization and information retrieval, implicitly learned from a large teacher model.
- Efficient Long-Context Inference: Interleaving local and global attention layers at a 3:1 ratio with 512 local attention, MobileLLM-Pro reduces prefill latency by 1.8x* and lowers KV cache size from 117MB to 40MB* compared to fully global attention, enabling faster and more memory-efficient inference. (*Assuming 8k context length)
- Near Lossless int4 Quantization: We provide int4 quantization-ready checkpoints for our pre-trained model with less than 1.3% quality degradation compared to floating point baselines:
- CPU: int4 weights (group size 32), int8 dynamic activations, int8 KV cache, with only 0.4% regression.
- Accelerators: int4 per-channel weights, with only 1.3% quality regression.
- Instruction Fine-Tuned Model: We provide a competitive instruction fine-tuned (IFT) model specializing in use-cases such as tool calling, question answering, rewriting and summarization.
MobileLLM-Pro sets a new standard for efficient, high-quality on-device language modeling. We invite the community to explore, evaluate, and build upon this model.
Model Information
Layers: 30
Attention Heads: 20
KV Heads: 4
Dimension: 1280
Hidden Dimension: 6144
Vocabulary Size: 202,048
Total Parameters: 1,084M (1.08B)
Input Modality: Text
Output Modality: Text
Languages: English
Training Method: Knowledge Distillation
Context Length: 128k tokens
Teacher Model: Llama 4-Scout
Loss Function: KL Divergence
Quantization: 16-bit, 4-bit
Other Features: Shared Embeddings, Local-Global Attention
Model Developer: Meta Reality Labs
Model Release Date: October 2025
License: MobileLLM-Pro is FAIR NC licensed
Results
Base Pretrained Model
Benchmark | P1 (FP) | P1 (Q-CPU) | P1 (Q-Acc) | Gemma 3 1B | Llama 3.2 1B |
---|---|---|---|---|---|
HellaSwag | 67.11% | 64.89% | 65.10% | 62.30% | 65.69% |
BoolQ | 76.24% | 77.49% | 76.36% | 63.20% | 62.51% |
PIQA | 76.55% | 76.66% | 75.52% | 73.80% | 75.14% |
SocialIQA | 50.87% | 51.18% | 50.05% | 48.90% | 45.60% |
TriviaQA | 39.85% | 37.26% | 36.42% | 39.80% | 23.81% |
NatQ | 15.76% | 15.43% | 13.19% | 9.48% | 5.48% |
ARC-c | 52.62% | 52.45% | 51.24% | 38.40% | 38.28% |
ARC-e | 76.28% | 76.58% | 75.73% | 73.00% | 63.47% |
WinoGrande | 62.83% | 62.43% | 61.96% | 58.20% | 61.09% |
OBQA | 43.60% | 44.20% | 40.40% | 37.20% | |
NIH | 100.00% | 96.44% | 98.67% |
FP = Full precision, bf16
Q-CPU = int4, group-wise quantized (for CPU)
Q-Acc = int4, channel-wise quantized (for Accelerators (ANE&HTP))
Instruction Tuned Model
Benchmark | P1 (IFT) | Gemma 3 1B (IFT) | Llama 3.2 1B (IFT) |
---|---|---|---|
MMLU | 44.8% | 29.9% | 49.3% |
IFEval | 62.0% | 80.2% | 59.5% |
MBPP | 46.8% | 35.2% | 39.6% |
HumanEval | 59.8% | 41.5% | 37.8% |
ARC-C | 62.7% | 59.4% | |
HellaSwag | 58.4% | 41.2% | |
BFCL v2 | 29.4% | 25.7% | |
Open Rewrite | 51.0% | 41.6% | |
TLDR9+ | 16.8% | 16.8% |
Training Data
We constructed our datamix by selecting publicly available datasets that cover a range of domains. Using data-specific simulation runs, each dataset's contribution to the training process was carefully balanced by assigning it a specific sampling weight. These weights remained consistent throughout the base model pretraining and were informed by the extended work of Automixer and additional ablation studies.
The pre-training datamix primarily consists of a large educational web dataset, which makes up the vast majority of the training data. Smaller but significant portions come from coding data, mathematics, Wikipedia, scientific papers, Q&A forums, and algebraic content. In total, the datamix includes approximately 1,500 million rows and 1,640 billion tokens.
For our instruction fine-tuned data-mix, we focus on data diversity from existing open-source fine-tuning corpora. Specifically, we combine datasets for general instruction tuning with chat, science, safety, coding and math domains. For our final DPO phase, we rely on completely synthetic datasets.
Training Process
Pretraining
Our general pre-training process contains three distinct phases using logit-based knowledge distillation from the Llama 4-Scout model and a novel model merging paradigm:
Phase 1 (KD): Language Learning – Learn general language skills from high-quality, well balanced pre-training data
Phase 2 (KD): Long-context awareness – Extend the model context-length to 128k tokens using implicit positional distillation from the teacher model
Phase 3 (KD): Domain abilities – Acquire domain understanding through annealing of multiple models in parallel and merging the specialist models, resulting in improvements across a diverse range of domains
On top of the three pre-training phases, we add a fourth phase of Quantization-Aware Training (QAT) for our 4-bit quantized model checkpoint.
Instruction Fine-Tuning
We split the instruction fine-tuning stage into three distinct phases combining SFT and DPO methods:
Phase 1 (SFT): Learn general instruction-following with a focus on data diversity
Phase 2 (SFT): Domain-weight the Phase 1 data given its shortcomings (e.g. upsample code data to improve logical reasoning)
Phase 3 (SFT + DPO): Train and align the model for safety and self-identification
Quantization
We apply Quantization Aware Training (QAT) to our baseline and instruction fine-tuned models, yielding quantization-ready checkpoints that can either be directly converted to integer datatype (with minimal quality loss) or used for QAT on additional data. We release two quantization-ready checkpoints:
- 4-bit groupwise weight quantization with block size 32, 8-bit dynamic activations, and 8-bit kv-cache quantizations — optimized for CPU/GPU backends (xnnpack).
- 4-bit channelwise quantization without activation quantization and 8-bit kv-cache quantizations — designed for edge hardware accelerators such as Apple Neural Engine (ANE) and Qualcomm’s Hexagon Tensor Processor (HTP).
Our QAT approach incorporates long-context awareness (up to 128k tokens) and self-knowledge distillation using the full-precision teacher model. We compared the QAT-trained model to a standard round-to-nearest Post-Training Quantization (PTQ) baseline. In the groupwise pre-training setting, we observe a 34% (absolute) regression in average benchmark score when using PTQ and only a 1.5% (absolute) regression for QAT. For instruction fine-tuning, we observe less than 1% average regression using QAT.
How to use
Full precision:
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
from huggingface_hub import login
login(token="<HF_TOKEN>")
MODEL_ID = "facebook/MobileLLM-Pro"
def generate(user_input: str, model, tokenizer, chat: bool) -> str:
if chat:
user_input = [{"role": "user", "content": user_input}]
inputs = tokenizer.apply_chat_template(
user_input, return_tensors="pt", add_generation_prompt=True
).to(model.device)
else:
inputs = tokenizer(user_input, return_tensors="pt")["input_ids"].to(model.device)
outputs = model.generate(inputs, max_new_tokens=128)
return tokenizer.decode(outputs[0], skip_special_tokens=True)
def main():
version = "instruct" # "base" | "instruct"
tokenizer = AutoTokenizer.from_pretrained(
MODEL_ID, trust_remote_code=True, subfolder=version
)
model = AutoModelForCausalLM.from_pretrained(
MODEL_ID, trust_remote_code=True, subfolder=version
)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)
model.eval()
prompt = "Why are open-source on-device language models great?"
result = generate(prompt, model, tokenizer, chat=(version == "instruct"))
print(result)
if __name__ == "__main__":
main()
Quantize Checkpoints
4-bit Groupwise Quantization
from torchao.quantization import quantize_
from torchao.quantization.qat import (
QATConfig,
IntxFakeQuantizeConfig
)
model = AutoModelForCausalLM.from_pretrained(
model_id,
trust_remote_code=True
)
# Prepare for QAT.
# 8-bit dynamic per-token quantization for activations
activation_config = IntxFakeQuantizeConfig(
torch.int8, "per_token", is_symmetric=False,
)
# 4-bit group-size=32 with range_learning=True for weights
weight_config = IntxFakeQuantizeConfig(
torch.int4,
group_size=32,
is_symmetric=True,
is_dynamic=True,
)
qat_config = QATConfig(
activation_config=activation_config,
weight_config=weight_config,
step="prepare",
)
quantize_(model, qat_config)
embedding_filter_fn = lambda m, fqn: isinstance(m, torch.nn.Embedding)
embedding_qat_config = IntxFakeQuantizeConfig(
torch.int4,
group_size=32,
is_symmetric=True,
is_dynamic=True,
)
quantize_(
model,
QATConfig(
weight_config=embedding_qat_config,
step="prepare"
),
embedding_filter_fn
)
# The model is now ready for Quantization aware Training (QAT)
# trainer.train()
model.save_pretrained(
save_directory=<QAT_save_directory>,
safe_serialization=False
)
# Convert model after training
from torchao.quantization import (
IntxWeightOnlyConfig,
Int8DynamicActivationIntxWeightConfig
)
from torchao.quantization.granularity import PerGroup
qat_convert_config = QATConfig(
Int8DynamicActivationIntxWeightConfig(
weight_dtype=torch.int4
weight_granularity=PerGroup(32),
),
step="convert",
)
quantize_(model, qat_convert_config)
embedding_convert_config = IntxWeightOnlyConfig(
weight_dtype=torch.int4,
granularity=PerGroup(32)
)
quantize_(
model,
QATConfig(
embedding_convert_config,
step="convert"
),
embedding_filter_fn
)
# Save model after convert
model.save_pretrained(
save_directory=<quantized_model_directory>,
safe_serialization=False
)
4-bit Channelwise Quantization
from torchao.quantization import quantize_
from torchao.quantization.granularity import PerAxis
from torchao.quantization.qat import (
initialize_fake_quantizers,
IntxFakeQuantizeConfig,
QATConfig
)
model = AutoModelForCausalLM.from_pretrained(
model_id,
trust_remote_code=True
)
# 4-bit per-channel with range_learning=True for weights
weight_config = IntxFakeQuantizeConfig(
torch.int4,
granularity=PerAxis(0),
is_symmetric=True,
is_dynamic=False,
range_learning=True,
)
qat_config = QATConfig(
weight_config=weight_config,
step="prepare",
)
quantize_(model, qat_config)
embedding_filter_fn = lambda m, fqn: isinstance(m, torch.nn.Embedding)
quantize_(model, qat_config, embedding_filter_fn)
# Initialize the fake quantizers for range-learning
example_inputs = (torch.tensor([[1]], dtype=torch.long),)
initialize_fake_quantizers(model, example_inputs)
# The model is now ready for Quantization aware Training (QAT)
# trainer.train()
model.save_pretrained(
save_directory=<QAT_save_directory>,
safe_serialization=False
)
# Convert model after training
from torchao.quantization import IntxWeightOnlyConfig
wt_convert_config = IntxWeightOnlyConfig(
weight_dtype=torch.int4,
granularity=PerAxis(0)
)
qat_convert_config = QATConfig(
wt_convert_config,
step="convert",
)
quantize_(model, qat_convert_config)
quantize_(model, qat_convert_config, embedding_filter_fn)
# Save model after convert
model.save_pretrained(
save_directory=<quantized_model_directory>,
safe_serialization=False
)
Latency benchmarking
Latency benchmarking was done on a Samsung Galaxy S25 CPU and Samsung Galaxy S24 Hexagon Tensor Processor (HTP). Models were exported to ExecuTorch with XNNPACK backend (for CPU) and HTP backend (for accelerator). The model size of the CPU model with 4-bit groupwise quantization is 590MB. The CPU and HTP prefill latency for different input prompt lengths of 2k, 4k and 8k along with decode speed for generating 1k tokens is shown in the following table.
Model / Prompt length | 2k | 4k | 8k |
---|---|---|---|
CPU Prefill Latency (s) | 8.9 | 24.8 | 63.5 |
CPU Decode Speed (tok/s) | 33.6 | 24.8 | 19.7 |
HTP Prefill Latency (s) | 1.96 | 3.38 | 9.82 |
HTP Decode Speed (tok/s) | 31.60 | 28.95 | 22.77 |
KV Cache Size (MB) | 14 | 23 | 40 |
To validate the benefit of interleaved local-global attention (LGA), we benchmark models across different prompt lengths and measure the speed-up in prefill & decode relative to using global attention at every layer:
Citation
@misc{mobilellm_pro,
title={MobileLLM-Pro Model Card},
author={Patrick Huber*, Ernie Chang*, Wei Wen*, Igor Fedorov*, Tarek Elgamal, Hanxian Huang, Naveen Suda, Chinnadhurai Sankar, Vish Vogeti, Yanghan Wang, Alex Gladkov, Kai Sheng Tai, Abdelrahman Elogeel, Tarek Hefny, Vikas Chandra, Ahmed Aly, Anuj Kumar, Raghuraman Krishnamoorthi**, Adithya Sagar**},
year={2025},
month={October},
url = {https://huggingface.co/facebook/MobileLLM-Pro}}
Contact
Patrick Huber, Meta Inc, Reality Labs ([email protected])
Ernie Chang, Meta Inc, Reality Labs ([email protected])
Wei Wen, Meta Inc, Reality Labs ([email protected])
Igor Fedorov, Meta Inc, Reality Labs ([email protected])
Raghuraman Krishnamoorthi, Meta Inc Reality Labs ([email protected])
Adithya Sagar, Meta Inc, Reality Labs ([email protected])
Acknowledgements
We want to thank the team involved in this project, especially: Kimish Patel, Andrew Or, Min Guo, Shen Xu, Brian Moran, Maho Takahashi, Claire Lesage, Rylan Conway, Karan Chadha, Matthew Grange, Tomasz Wołcyrz, Shiv Desai, Amarlin Anand, Joele Sires, Robert Carrillo, Francisc Bungiu, Jayden Yu, AJ Brush, Yang Li, Samuel Selvan, Anand Sharma, Peng Shan, Anand Dass, Abhishek Sharma
License
MobileLLM-Pro is distributed under the FAIR NC license