Light-TLLM-7B
Introduction
Light-TLLM-7B is a machine translation focused variant of Qwen2.5-7B developed by 360 AI Research.
This repo contains the machine translation specialized 7B model, which has the following features:
- Type: Causal Language Models for Machine Translation
- Training Stage: Continued pretraining, curriculum SFT, and MtPO reinforcement learning
- Architecture: transformers with RoPE, SwiGLU, RMSNorm, and Attention QKV bias
- Number of Parameters: 7.61B (6.53B non-embedding)
- Number of Layers: 28
- Number of Attention Heads (GQA): 28 for Q and 4 for KV
- Context Length: Up to 131,072 tokens
- Vocabulary Size: 180,736 tokens with MtPO vocabulary expansion
Requirements
The code of Light-TLLM-7B is compatible with the latest Hugging Face transformers library. We recommend using the latest version of transformers.
With transformers<4.37.0, you will encounter the following error:
KeyError: 'qwen2'
Quickstart
Here provides a code snippet to show you how to load the tokenizer and model for machine translation tasks.
from transformers import AutoModelForCausalLM, AutoTokenizer
model_name = "qihoo360/Light-TLLM-7B"
model = AutoModelForCausalLM.from_pretrained(
model_name,
torch_dtype="auto",
device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained(model_name)
# Example translation prompt
prompt = "Translate the following English text to Chinese: Hello, how are you today?"
messages = [
{"role": "system", "content": "You are a professional translator. Translate the given text accurately and naturally."},
{"role": "user", "content": prompt}
]
text = tokenizer.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True
)
model_inputs = tokenizer([text], return_tensors="pt").to(model.device)
generated_ids = model.generate(
**model_inputs,
max_new_tokens=512,
temperature=0.7,
do_sample=True
)
generated_ids = [
output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
]
response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
print(response)
Training Pipeline (MtPO)
Runs in four stages from tokenizer expansion to reinforcement learning alignment.
- Stage 1 - Vocabulary expansion: Extend the Qwen2.5 tokenizer with 3k-4k tokens per target language (Khmer, Lao, Mongolian, Myanmar, Tamil, Thai, Tibetan, Uyghur). FLORES-Plus diagnostics show 2.1x-5.4x compression gains, cutting Khmer token counts from 402 to 103 for representative passages.
- Stage 2 - Balanced continued pretraining: Continue training on 200B tokens with a 1:1 mix between English and the expanded low-resource corpus to preserve high-resource coverage while materially improving low-resource fluency.
- Stage 3 - Curriculum SFT: Train on a 7M-sample blend (5:1 general instructions vs. multilingual data) that progresses from base instruction-following to ASEAN translation and mixed-format prompts.
- Stage 4 - MtPO reinforcement learning: Optimize with entropy-tempered policy updates that keep sampling temperature consistent, apply asymmetric ratio clipping, and normalize advantages at the microbatch level to avoid length bias or entropy collapse.
Verifiable Reward Guardrails
Reinforcement Learning with Verifiable Rewards (RLVR) combines the translation reward model with deterministic validators. During RL we sample K candidates per prompt, score them with RLVR, and keep the top-G diverse outputs for gradient updates. Each candidate is checked for:
- Length ratio safety relative to the source (default bounds 0.5-2.0 with soft penalties outside range)
- Structural token preservation for HTML, Markdown, and code blocks using lightweight parsers
- Target-language verification via a confidence-gated language ID classifier
- Code-mixing penalties that suppress unintended language drift
These verifiable rewards are added to the semantic score so bad outputs receive immediate negative credit, while high-quality candidates remain eligible for optimization.
Data and Training Budget
Summary of resources and evaluation suites used during MtPO development.
- Continued pretraining: 200B tokens with adaptive sampling over English, ASEAN, Tibetan, Mongolian, Tamil, and Uyghur corpora
- Reinforcement learning: 60k steps, batch size 128, top-G candidate selection with RLVR filtering
- Reward model: Preference data spans ten error categories (accuracy, fluency, terminology, formatting, code-mixing, etc.)
- Benchmarks: FLORES-Plus (90 directions), BBH, CMMLU, HellaSwag, MMLU
Model Details
- Model Type: Qwen2-based Causal Language Model
- Language(s): Multilingual (English, Chinese, Khmer, Lao, Myanmar, Thai, Tibetan, Mongolian, Tamil, Malay, Indonesian, Filipino, Vietnamese, Uyghur, etc.)
- License: Apache 2.0
- Finetuned from: Qwen/Qwen2.5-7B
- Model Size: 7.61B parameters
- Context Length: 131,072 tokens
Usage
This model is specifically designed for machine translation tasks. It can handle various translation scenarios including:
- English <-> Chinese translation
- Multilingual translation tasks
- Professional document translation
- Conversational translation
Evaluation
Translation and General Benchmarks
Light-TLLM-7B is evaluated on FLORES-Plus (90 directions) and standard instruction-following benchmarks. Scores below use sacreBLEU (higher is better) and zero-shot accuracy (percentage).
| Model | Group | xx->en | en->xx | xx->xx | Avg. | BBH | CMMLU | HellaSwag | MMLU |
|---|---|---|---|---|---|---|---|---|---|
| Gemma3-27B-IT | Multilingual chat | 36.8 | 30.7 | 22.3 | 24.7 | 55.9 | 55.9 | 55.9 | 56.0 |
| Qwen3-8B | Multilingual chat | 31.1 | 23.3 | 14.4 | 16.9 | 63.8 | 60.8 | 26.0 | 51.3 |
| Qwen2.5-7B-Instruct | Multilingual chat | 24.8 | 17.4 | 9.2 | 11.6 | 54.4 | 64.1 | 85.2 | 40.9 |
| Apertus-8B-Instruct | Multilingual chat | 32.5 | 25.7 | 15.6 | 18.3 | 49.2 | 45.3 | 64.2 | 45.2 |
| Tower-Plus-9B | Multilingual chat | 28.2 | 18.3 | 9.8 | 12.5 | 40.4 | 57.2 | 73.1 | 42.1 |
| Qwen-MT-Plus | Translation-focused | 34.0 | 29.6 | 19.6 | 22.1 | - | - | - | - |
| Seed-X-PPO-7B | Translation-focused | 25.9 | 22.6 | 10.5 | 13.3 | - | - | - | - |
| Hunyuan-MT-7B | Translation-focused | 24.6 | 23.4 | 14.8 | 16.6 | - | - | - | - |
| Light-TLLM-7B-SFT | Our models | 35.4 | 32.0 | 22.7 | 24.3 | 59.6 | 61.4 | 83.7 | 47.2 |
| Light-TLLM-7B-RL | Our models | 36.1 | 32.7 | 23.1 | 24.9 | 60.9 | 63.2 | 85.2 | 48.5 |
- en->xx directions gain +1.1 BLEU over the next best 7B system while preserving reasoning accuracy (+1.3 MMLU over SFT).
- Average BLEU across all FLORES-Plus directions rises to 24.9 despite the compact 7B footprint.
Tokenizer Efficiency
Vocabulary expansion provides substantial compression on targeted scripts (higher compression ratio means fewer tokens per sentence).
| Language | Added tokens | Old compression ratio | New compression ratio | Speedup |
|---|---|---|---|---|
| Khmer | 3712 | 0.85 | 3.49 | 4.09x |
| Lao | 3359 | 0.85 | 3.05 | 3.59x |
| Myanmar | 3226 | 0.69 | 2.87 | 4.17x |
| Thai | 2958 | 1.79 | 2.97 | 1.66x |
| Tibetan | 3920 | 0.75 | 4.03 | 5.39x |
- Khmer passages shrink from 402 tokens to 103 tokens in the running example used in the paper.
- Compression gains translate into lower latency and memory cost during decoding for low-resource scripts.
Constraint Reliability (RLVR)
RLVR introduces deterministic checks that reduce failure modes compared with general chat models and MT baselines.
| Model | Language targeting | Length control | Format preservation | Code mixing | Overall |
|---|---|---|---|---|---|
| Light-TLLM-7B-RL | 97.8 | 99.2 | 92.15 | 92.3 | 95.3 |
| Qwen2.5-7B-Instruct | 92.0 | 97.0 | 51.8 | 62.8 | 75.9 |
| Gemma3-27B-IT | 97.4 | 91.6 | 42.1 | 90.9 | 80.5 |
| Qwen-MT-Plus | 97.6 | 99.8 | 82.5 | 94.8 | 93.6 |
| Seed-X-PPO-7B | 97.6 | 79.8 | 79.0 | 90.3 | 86.6 |
| DeepSeek-V3 | 95.4 | 95.7 | 67.6 | 95.0 | 88.4 |
| Hunyuan-MT-7B | 91.8 | 90.7 | 71.1 | 96.2 | 87.4 |
- Format retention jumps to 92.15 percent versus 51.8 percent for Qwen2.5-7B-Instruct, mitigating HTML or Markdown corruption.
- Language targeting stays above 97 percent while MtPO avoids verbosity by normalizing advantages at the microbatch level.
- Overall pass rate reaches 95.3 percent, surpassing Qwen2.5-7B-Instruct by 19.4 points, DeepSeek-V3 by 6.9 points, and Qwen-MT-Plus by 1.7 points despite identical constraint settings.
Per-Language FLORES Highlights
- English->Thai: 34.1 BLEU, +1.5 over Qwen-MT-Plus.
- English->Myanmar: 12.9 BLEU with stable length control.
- English->Filipino: 35.4 BLEU after MtPO, combining instruction fidelity and translation quality.
- Khmer->English: 44.7 BLEU, reflecting gains from tokenizer expansion.
- Vietnamese->English: 37.6 BLEU with consistent improvements across ASEAN language pairs.
Citation
If you find our work helpful, feel free to give us a cite.
@inproceedings{liu2026mtpo,
title = {Light-TLLM-7B},
author = {Light-MT Team},
booktitle = {International Conference on Learning Representations},
year = {2025},
url = {https://huggingface.co/qihoo360/Light-TLLM-7B}
}
Disclaimer
This model is provided for research and educational purposes. Please ensure responsible use and compliance with applicable laws and regulations when using this model.
- Downloads last month
- 35
Model tree for qihoo360/Light-TLLM-7B
Base model
Qwen/Qwen2.5-7B