You need to agree to share your contact information to access this model

The information you provide will be collected, stored, processed and shared in accordance with the Meta Privacy Policy.

Log in or Sign Up to review the conditions and access this model content.

MobileLLM-P1 Model Card

We are introducing MobileLLM-P1 or Pro, a 1B foundational language model in the MobileLLM series, designed to deliver high-quality, efficient on-device inference across a wide range of general language modeling tasks.
We open-source two variants of the model: A pre-trained base model along with quantized checkpoints for CPU and accelerator inference, as well as an instruction tuned version, showing competitive performance against models in the this size range on tasks like tool calling, question answering, rewriting and summarization.

🤗   Chat with MobileLLM-Pro

Key Features

  • Strong Pre-training Performance: MobileLLM-Pro base achieves impressive pre-training results, outperforming Gemma 3 1B and Llama 3.2 1B by on average 5.7% and 7.9% respectively on reasoning, knowledge, and long-context retrieval benchmarks. This performance is achieved by pre-training on less than 2T fully open-source tokens.
  • 128k Context Window: The model supports up to 128k tokens, enabling long-context understanding for applications such as document summarization and information retrieval, implicitly learned from a large teacher model.
  • Efficient Long-Context Inference: Interleaving local and global attention layers at a 3:1 ratio with 512 local attention, MobileLLM-Pro reduces prefill latency by 1.8x* and lowers KV cache size from 117MB to 40MB* compared to fully global attention, enabling faster and more memory-efficient inference. (*Assuming 8k context length)
  • Near Lossless int4 Quantization: We provide int4 quantization-ready checkpoints for our pre-trained model with less than 1.3% quality degradation compared to floating point baselines:
    • CPU: int4 weights (group size 32), int8 dynamic activations, int8 KV cache, with only 0.4% regression.
    • Accelerators: int4 per-channel weights, with only 1.3% quality regression.
  • Instruction Fine-Tuned Model: We provide a competitive instruction fine-tuned (IFT) model specializing in use-cases such as tool calling, question answering, rewriting and summarization.

MobileLLM-Pro sets a new standard for efficient, high-quality on-device language modeling. We invite the community to explore, evaluate, and build upon this model.

Model Information

Layers: 30
Attention Heads: 20
KV Heads: 4
Dimension: 1280
Hidden Dimension: 6144
Vocabulary Size: 202,048
Total Parameters: 1,084M (1.08B)

Input Modality: Text
Output Modality: Text
Languages: English

Training Method: Knowledge Distillation
Context Length: 128k tokens
Teacher Model: Llama 4-Scout
Loss Function: KL Divergence
Quantization: 16-bit, 4-bit
Other Features: Shared Embeddings, Local-Global Attention

Model Developer: Meta Reality Labs
Model Release Date: October 2025
License: MobileLLM-Pro is FAIR NC licensed

Results

Base Pretrained Model

Benchmark P1 (FP) P1 (Q-CPU) P1 (Q-Acc) Gemma 3 1B Llama 3.2 1B
HellaSwag 67.11% 64.89% 65.10% 62.30% 65.69%
BoolQ 76.24% 77.49% 76.36% 63.20% 62.51%
PIQA 76.55% 76.66% 75.52% 73.80% 75.14%
SocialIQA 50.87% 51.18% 50.05% 48.90% 45.60%
TriviaQA 39.85% 37.26% 36.42% 39.80% 23.81%
NatQ 15.76% 15.43% 13.19% 9.48% 5.48%
ARC-c 52.62% 52.45% 51.24% 38.40% 38.28%
ARC-e 76.28% 76.58% 75.73% 73.00% 63.47%
WinoGrande 62.83% 62.43% 61.96% 58.20% 61.09%
OBQA 43.60% 44.20% 40.40% 37.20%
NIH 100.00% 96.44% 98.67%

FP = Full precision, bf16
Q-CPU = int4, group-wise quantized (for CPU)
Q-Acc = int4, channel-wise quantized (for Accelerators (ANE&HTP))

Instruction Tuned Model

Benchmark P1 (IFT) Gemma 3 1B (IFT) Llama 3.2 1B (IFT)
MMLU 44.8% 29.9% 49.3%
IFEval 62.0% 80.2% 59.5%
MBPP 46.8% 35.2% 39.6%
HumanEval 59.8% 41.5% 37.8%
ARC-C 62.7% 59.4%
HellaSwag 58.4% 41.2%
BFCL v2 29.4% 25.7%
Open Rewrite 51.0% 41.6%
TLDR9+ 16.8% 16.8%

Training Data

We constructed our datamix by selecting publicly available datasets that cover a range of domains. Using data-specific simulation runs, each dataset's contribution to the training process was carefully balanced by assigning it a specific sampling weight. These weights remained consistent throughout the base model pretraining and were informed by the extended work of Automixer and additional ablation studies.
The pre-training datamix primarily consists of a large educational web dataset, which makes up the vast majority of the training data. Smaller but significant portions come from coding data, mathematics, Wikipedia, scientific papers, Q&A forums, and algebraic content. In total, the datamix includes approximately 1,500 million rows and 1,640 billion tokens.
For our instruction fine-tuned data-mix, we focus on data diversity from existing open-source fine-tuning corpora. Specifically, we combine datasets for general instruction tuning with chat, science, safety, coding and math domains. For our final DPO phase, we rely on completely synthetic datasets.

Training Process

Pretraining

Our general pre-training process contains three distinct phases using logit-based knowledge distillation from the Llama 4-Scout model and a novel model merging paradigm:

Phase 1 (KD): Language Learning – Learn general language skills from high-quality, well balanced pre-training data
Phase 2 (KD): Long-context awareness – Extend the model context-length to 128k tokens using implicit positional distillation from the teacher model
Phase 3 (KD): Domain abilities – Acquire domain understanding through annealing of multiple models in parallel and merging the specialist models, resulting in improvements across a diverse range of domains

image

On top of the three pre-training phases, we add a fourth phase of Quantization-Aware Training (QAT) for our 4-bit quantized model checkpoint.

Instruction Fine-Tuning

We split the instruction fine-tuning stage into three distinct phases combining SFT and DPO methods:

Phase 1 (SFT): Learn general instruction-following with a focus on data diversity
Phase 2 (SFT): Domain-weight the Phase 1 data given its shortcomings (e.g. upsample code data to improve logical reasoning)
Phase 3 (SFT + DPO): Train and align the model for safety and self-identification

image

Quantization

image/png

We apply Quantization Aware Training (QAT) to our baseline and instruction fine-tuned models, yielding quantization-ready checkpoints that can either be directly converted to integer datatype (with minimal quality loss) or used for QAT on additional data. We release two quantization-ready checkpoints:

  • 4-bit groupwise weight quantization with block size 32, 8-bit dynamic activations, and 8-bit kv-cache quantizations — optimized for CPU/GPU backends (xnnpack).
  • 4-bit channelwise quantization without activation quantization and 8-bit kv-cache quantizations — designed for edge hardware accelerators such as Apple Neural Engine (ANE) and Qualcomm’s Hexagon Tensor Processor (HTP).

Our QAT approach incorporates long-context awareness (up to 128k tokens) and self-knowledge distillation using the full-precision teacher model. We compared the QAT-trained model to a standard round-to-nearest Post-Training Quantization (PTQ) baseline. In the groupwise pre-training setting, we observe a 34% (absolute) regression in average benchmark score when using PTQ and only a 1.5% (absolute) regression for QAT. For instruction fine-tuning, we observe less than 1% average regression using QAT.

How to use

Full precision:

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
from huggingface_hub import login

login(token="<HF_TOKEN>")
MODEL_ID = "facebook/MobileLLM-Pro"

def generate(user_input: str, model, tokenizer, chat: bool) -> str:
    if chat:
        user_input = [{"role": "user", "content": user_input}]
        inputs = tokenizer.apply_chat_template(
            user_input, return_tensors="pt", add_generation_prompt=True
        ).to(model.device)
    else:
        inputs = tokenizer(user_input, return_tensors="pt")["input_ids"].to(model.device)
    outputs = model.generate(inputs, max_new_tokens=128)
    return tokenizer.decode(outputs[0], skip_special_tokens=True)

def main():
    version = "instruct"  # "base" | "instruct"
    tokenizer = AutoTokenizer.from_pretrained(
        MODEL_ID, trust_remote_code=True, subfolder=version
    )
    model = AutoModelForCausalLM.from_pretrained(
        MODEL_ID, trust_remote_code=True, subfolder=version
    )
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
    model.to(device)
    model.eval()

    prompt = "Why are open-source on-device language models great?"
    result = generate(prompt, model, tokenizer, chat=(version == "instruct"))
    print(result)

if __name__ == "__main__":
    main()

Quantize Checkpoints

4-bit Groupwise Quantization

from torchao.quantization import quantize_
from torchao.quantization.qat import (
    QATConfig, 
    IntxFakeQuantizeConfig
)

model = AutoModelForCausalLM.from_pretrained(
    model_id,
    trust_remote_code=True
)

# Prepare for QAT.
# 8-bit dynamic per-token quantization for activations
activation_config = IntxFakeQuantizeConfig(
    torch.int8, "per_token", is_symmetric=False,
)
# 4-bit group-size=32 with range_learning=True for weights
weight_config = IntxFakeQuantizeConfig(
    torch.int4,
    group_size=32,
    is_symmetric=True,
    is_dynamic=True,
)
qat_config = QATConfig(
    activation_config=activation_config,
    weight_config=weight_config,
    step="prepare",
)
quantize_(model, qat_config)

embedding_filter_fn = lambda m, fqn: isinstance(m, torch.nn.Embedding)
embedding_qat_config = IntxFakeQuantizeConfig(
    torch.int4,
    group_size=32,
    is_symmetric=True,
    is_dynamic=True,
)
quantize_(
    model,
    QATConfig(
        weight_config=embedding_qat_config, 
   step="prepare"
    ),
    embedding_filter_fn
)

# The model is now ready for Quantization aware Training (QAT)
# trainer.train()
model.save_pretrained(
    save_directory=<QAT_save_directory>,
    safe_serialization=False
)

# Convert model after training
from torchao.quantization import (
    IntxWeightOnlyConfig,
    Int8DynamicActivationIntxWeightConfig
)
from torchao.quantization.granularity import PerGroup

qat_convert_config = QATConfig(
    Int8DynamicActivationIntxWeightConfig(
        weight_dtype=torch.int4
        weight_granularity=PerGroup(32),
    ),
    step="convert",
)
quantize_(model, qat_convert_config)
embedding_convert_config = IntxWeightOnlyConfig(
    weight_dtype=torch.int4,
    granularity=PerGroup(32)
)
quantize_(
    model,
    QATConfig(
        embedding_convert_config,
        step="convert"
    ),
    embedding_filter_fn
)

# Save model after convert
model.save_pretrained(
    save_directory=<quantized_model_directory>,
    safe_serialization=False
)

4-bit Channelwise Quantization

from torchao.quantization import quantize_
from torchao.quantization.granularity import PerAxis
from torchao.quantization.qat import (
    initialize_fake_quantizers,
    IntxFakeQuantizeConfig,
    QATConfig
)

model = AutoModelForCausalLM.from_pretrained(
    model_id,
    trust_remote_code=True
)

# 4-bit per-channel with range_learning=True for weights
weight_config = IntxFakeQuantizeConfig(
    torch.int4,
    granularity=PerAxis(0),
    is_symmetric=True,
    is_dynamic=False,
    range_learning=True,
)
qat_config = QATConfig(
    weight_config=weight_config,
    step="prepare",
)
quantize_(model, qat_config)

embedding_filter_fn = lambda m, fqn: isinstance(m, torch.nn.Embedding)
quantize_(model, qat_config, embedding_filter_fn)

# Initialize the fake quantizers for range-learning
example_inputs = (torch.tensor([[1]], dtype=torch.long),)
initialize_fake_quantizers(model, example_inputs)


# The model is now ready for Quantization aware Training (QAT)
# trainer.train()
model.save_pretrained(
    save_directory=<QAT_save_directory>,
    safe_serialization=False
)

# Convert model after training
from torchao.quantization import IntxWeightOnlyConfig

wt_convert_config = IntxWeightOnlyConfig(
    weight_dtype=torch.int4,
    granularity=PerAxis(0)
)
qat_convert_config = QATConfig(
    wt_convert_config,
    step="convert",
)
quantize_(model, qat_convert_config)
quantize_(model, qat_convert_config, embedding_filter_fn)

# Save model after convert
model.save_pretrained(
    save_directory=<quantized_model_directory>,
    safe_serialization=False
)

Latency benchmarking

Latency benchmarking was done on a Samsung Galaxy S25 CPU and Samsung Galaxy S24 Hexagon Tensor Processor (HTP). Models were exported to ExecuTorch with XNNPACK backend (for CPU) and HTP backend (for accelerator). The model size of the CPU model with 4-bit groupwise quantization is 590MB. The CPU and HTP prefill latency for different input prompt lengths of 2k, 4k and 8k along with decode speed for generating 1k tokens is shown in the following table.

Model / Prompt length 2k 4k 8k
CPU Prefill Latency (s) 8.9 24.8 63.5
CPU Decode Speed (tok/s) 33.6 24.8 19.7
HTP Prefill Latency (s) 1.96 3.38 9.82
HTP Decode Speed (tok/s) 31.60 28.95 22.77
KV Cache Size (MB) 14 23 40

To validate the benefit of interleaved local-global attention (LGA), we benchmark models across different prompt lengths and measure the speed-up in prefill & decode relative to using global attention at every layer:

image

Citation

@misc{mobilellm_pro,
title={MobileLLM-Pro Model Card},
author={Patrick Huber*, Ernie Chang*, Wei Wen*, Igor Fedorov*, Tarek Elgamal, Hanxian Huang, Naveen Suda, Chinnadhurai Sankar, Vish Vogeti, Yanghan Wang, Alex Gladkov, Kai Sheng Tai, Abdelrahman Elogeel, Tarek Hefny, Vikas Chandra, Ahmed Aly, Anuj Kumar, Raghuraman Krishnamoorthi**, Adithya Sagar**},
year={2025},
month={October},
url = {https://huggingface.co/facebook/MobileLLM-Pro}}

Contact

Patrick Huber, Meta Inc, Reality Labs ([email protected])
Ernie Chang, Meta Inc, Reality Labs ([email protected])
Wei Wen, Meta Inc, Reality Labs ([email protected])
Igor Fedorov, Meta Inc, Reality Labs ([email protected])
Raghuraman Krishnamoorthi, Meta Inc Reality Labs ([email protected])
Adithya Sagar, Meta Inc, Reality Labs ([email protected])

Acknowledgements

We want to thank the team involved in this project, especially: Kimish Patel, Andrew Or, Min Guo, Shen Xu, Brian Moran, Maho Takahashi, Claire Lesage, Rylan Conway, Karan Chadha, Matthew Grange, Tomasz Wołcyrz, Shiv Desai, Amarlin Anand, Joele Sires, Robert Carrillo, Francisc Bungiu, Jayden Yu, AJ Brush, Yang Li, Samuel Selvan, Anand Sharma, Peng Shan, Anand Dass, Abhishek Sharma

License

MobileLLM-Pro is distributed under the FAIR NC license

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Space using facebook/MobileLLM-Pro 1