InfLLM-V2-Short-Dense-Base

Project Links: [Paper] [InfLLM-V2 Models] [CUDA Kernel Code]

🚀 Model Description

InfLLM-V2-Short-Dense-Base is the foundational base model for the InfLLM-V2 long-context training pipeline.

This model is pre-trained on a large corpus of short-text data and utilizes a standard dense attention mechanism. It serves as the starting checkpoint for the continued training phase, which unlocks the advanced long-context capabilities seen in the final sparse model.

It is highly performant on short-text tasks and provides a solid foundation for further fine-tuning or continued training.

📌 Role in the InfLLM-V2 Ecosystem

This model is the crucial first step in the InfLLM-V2 training workflow. The entire process is designed to be transparent and reproducible:

Step 1: Start from this base model.
- ➡️ InfLLM-V2-Short-Dense-Base (This Model): The base model pre-trained on short texts with dense attention.
Step 2: Continue training on long-text data.
- Use the InfLLM-V2-data-5B dataset to perform continued training.
Step 3: Get the final long-context model.
- The result is the InfLLM-V2-Long-Sparse-Base, which is equipped with powerful sparse attention for long-context tasks.

💻 How to Use

As a standard dense-attention model, you can use it directly with the transformers library without any special configuration.

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

# Set device
device = "cuda" if torch.cuda.is_available() else "cpu"

# Load model and tokenizer
model_id = "openbmb/InfLLM-V2-Short-Dense-Base"
tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(model_id,trust_remote_code=True).to(device,dtype=torch.bfloat16)

# Create a prompt
prompt = "The capital of France is"
inputs = tokenizer(prompt, return_tensors="pt").to(device)

# Generate text
outputs = model.generate(**inputs, max_new_tokens=10)
generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)

print(generated_text)
# Expected output: The capital of France is Paris.

Note: This model is optimized for short sequences. For long-context capabilities, please use the final InfLLM-V2-Long-Sparse-Base model.

Citation

If you use our work in your research, please cite our paper:

@misc{zhao2025infllmv2densesparseswitchableattention,
      title={InfLLM-V2: Dense-Sparse Switchable Attention for Seamless Short-to-Long Adaptation}, 
      author={Weilin Zhao and Zihan Zhou and Zhou Su and Chaojun Xiao and Yuxuan Li and Yanghao Li and Yudi Zhang and Weilun Zhao and Zhen Li and Yuxiang Huang and Ao Sun and Xu Han and Zhiyuan Liu},
      year={2025},
      eprint={2509.24663},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2509.24663}, 
}

Downloads last month: 25

Safetensors

Model size

8B params

Tensor type

F32

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for openbmb/InfLLM-V2-Short-Dense-Base

Quantizations

3 models