Light-TLLM-7B

Introduction

Light-TLLM-7B is a machine translation focused variant of Qwen2.5-7B developed by 360 AI Research.

This repo contains the machine translation specialized 7B model, which has the following features:

Type: Causal Language Models for Machine Translation
Training Stage: Continued pretraining, curriculum SFT, and MtPO reinforcement learning
Architecture: transformers with RoPE, SwiGLU, RMSNorm, and Attention QKV bias
Number of Parameters: 7.61B (6.53B non-embedding)
Number of Layers: 28
Number of Attention Heads (GQA): 28 for Q and 4 for KV
Context Length: Up to 131,072 tokens
Vocabulary Size: 180,736 tokens with MtPO vocabulary expansion

Requirements

The code of Light-TLLM-7B is compatible with the latest Hugging Face transformers library. We recommend using the latest version of transformers.

With transformers<4.37.0, you will encounter the following error:

KeyError: 'qwen2'

Quickstart

Here provides a code snippet to show you how to load the tokenizer and model for machine translation tasks.

from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "qihoo360/Light-TLLM-7B"

model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype="auto",
    device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained(model_name)

# Example translation prompt
prompt = "Translate the following English text to Chinese: Hello, how are you today?"
messages = [
    {"role": "system", "content": "You are a professional translator. Translate the given text accurately and naturally."},
    {"role": "user", "content": prompt}
]
text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True
)
model_inputs = tokenizer([text], return_tensors="pt").to(model.device)

generated_ids = model.generate(
    **model_inputs,
    max_new_tokens=512,
    temperature=0.7,
    do_sample=True
)
generated_ids = [
    output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
]

response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
print(response)

Training Pipeline (MtPO)

Runs in four stages from tokenizer expansion to reinforcement learning alignment.

Stage 1 - Vocabulary expansion: Extend the Qwen2.5 tokenizer with 3k-4k tokens per target language (Khmer, Lao, Mongolian, Myanmar, Tamil, Thai, Tibetan, Uyghur). FLORES-Plus diagnostics show 2.1x-5.4x compression gains, cutting Khmer token counts from 402 to 103 for representative passages.
Stage 2 - Balanced continued pretraining: Continue training on 200B tokens with a 1:1 mix between English and the expanded low-resource corpus to preserve high-resource coverage while materially improving low-resource fluency.
Stage 3 - Curriculum SFT: Train on a 7M-sample blend (5:1 general instructions vs. multilingual data) that progresses from base instruction-following to ASEAN translation and mixed-format prompts.
Stage 4 - MtPO reinforcement learning: Optimize with entropy-tempered policy updates that keep sampling temperature consistent, apply asymmetric ratio clipping, and normalize advantages at the microbatch level to avoid length bias or entropy collapse.

Verifiable Reward Guardrails

Reinforcement Learning with Verifiable Rewards (RLVR) combines the translation reward model with deterministic validators. During RL we sample K candidates per prompt, score them with RLVR, and keep the top-G diverse outputs for gradient updates. Each candidate is checked for:

Length ratio safety relative to the source (default bounds 0.5-2.0 with soft penalties outside range)
Structural token preservation for HTML, Markdown, and code blocks using lightweight parsers
Target-language verification via a confidence-gated language ID classifier
Code-mixing penalties that suppress unintended language drift

These verifiable rewards are added to the semantic score so bad outputs receive immediate negative credit, while high-quality candidates remain eligible for optimization.

Data and Training Budget

Summary of resources and evaluation suites used during MtPO development.

Continued pretraining: 200B tokens with adaptive sampling over English, ASEAN, Tibetan, Mongolian, Tamil, and Uyghur corpora
Reinforcement learning: 60k steps, batch size 128, top-G candidate selection with RLVR filtering
Reward model: Preference data spans ten error categories (accuracy, fluency, terminology, formatting, code-mixing, etc.)
Benchmarks: FLORES-Plus (90 directions), BBH, CMMLU, HellaSwag, MMLU

Model Details

Model Type: Qwen2-based Causal Language Model
Language(s): Multilingual (English, Chinese, Khmer, Lao, Myanmar, Thai, Tibetan, Mongolian, Tamil, Malay, Indonesian, Filipino, Vietnamese, Uyghur, etc.)
License: Apache 2.0
Finetuned from: Qwen/Qwen2.5-7B
Model Size: 7.61B parameters
Context Length: 131,072 tokens

Usage

This model is specifically designed for machine translation tasks. It can handle various translation scenarios including:

English <-> Chinese translation
Multilingual translation tasks
Professional document translation
Conversational translation

Evaluation

Translation and General Benchmarks

Light-TLLM-7B is evaluated on FLORES-Plus (90 directions) and standard instruction-following benchmarks. Scores below use sacreBLEU (higher is better) and zero-shot accuracy (percentage).

Model	Group	xx->en	en->xx	xx->xx	Avg.	BBH	CMMLU	HellaSwag	MMLU
Gemma3-27B-IT	Multilingual chat	36.8	30.7	22.3	24.7	55.9	55.9	55.9	56.0
Qwen3-8B	Multilingual chat	31.1	23.3	14.4	16.9	63.8	60.8	26.0	51.3
Qwen2.5-7B-Instruct	Multilingual chat	24.8	17.4	9.2	11.6	54.4	64.1	85.2	40.9
Apertus-8B-Instruct	Multilingual chat	32.5	25.7	15.6	18.3	49.2	45.3	64.2	45.2
Tower-Plus-9B	Multilingual chat	28.2	18.3	9.8	12.5	40.4	57.2	73.1	42.1
Qwen-MT-Plus	Translation-focused	34.0	29.6	19.6	22.1	-	-	-	-
Seed-X-PPO-7B	Translation-focused	25.9	22.6	10.5	13.3	-	-	-	-
Hunyuan-MT-7B	Translation-focused	24.6	23.4	14.8	16.6	-	-	-	-
Light-TLLM-7B-SFT	Our models	35.4	32.0	22.7	24.3	59.6	61.4	83.7	47.2
Light-TLLM-7B-RL	Our models	36.1	32.7	23.1	24.9	60.9	63.2	85.2	48.5

en->xx directions gain +1.1 BLEU over the next best 7B system while preserving reasoning accuracy (+1.3 MMLU over SFT).
Average BLEU across all FLORES-Plus directions rises to 24.9 despite the compact 7B footprint.

Tokenizer Efficiency

Vocabulary expansion provides substantial compression on targeted scripts (higher compression ratio means fewer tokens per sentence).

Language	Added tokens	Old compression ratio	New compression ratio	Speedup
Khmer	3712	0.85	3.49	4.09x
Lao	3359	0.85	3.05	3.59x
Myanmar	3226	0.69	2.87	4.17x
Thai	2958	1.79	2.97	1.66x
Tibetan	3920	0.75	4.03	5.39x

Khmer passages shrink from 402 tokens to 103 tokens in the running example used in the paper.
Compression gains translate into lower latency and memory cost during decoding for low-resource scripts.

Constraint Reliability (RLVR)

RLVR introduces deterministic checks that reduce failure modes compared with general chat models and MT baselines.

Model	Language targeting	Length control	Format preservation	Code mixing	Overall
Light-TLLM-7B-RL	97.8	99.2	92.15	92.3	95.3
Qwen2.5-7B-Instruct	92.0	97.0	51.8	62.8	75.9
Gemma3-27B-IT	97.4	91.6	42.1	90.9	80.5
Qwen-MT-Plus	97.6	99.8	82.5	94.8	93.6
Seed-X-PPO-7B	97.6	79.8	79.0	90.3	86.6
DeepSeek-V3	95.4	95.7	67.6	95.0	88.4
Hunyuan-MT-7B	91.8	90.7	71.1	96.2	87.4

Format retention jumps to 92.15 percent versus 51.8 percent for Qwen2.5-7B-Instruct, mitigating HTML or Markdown corruption.
Language targeting stays above 97 percent while MtPO avoids verbosity by normalizing advantages at the microbatch level.
Overall pass rate reaches 95.3 percent, surpassing Qwen2.5-7B-Instruct by 19.4 points, DeepSeek-V3 by 6.9 points, and Qwen-MT-Plus by 1.7 points despite identical constraint settings.

Per-Language FLORES Highlights

English->Thai: 34.1 BLEU, +1.5 over Qwen-MT-Plus.
English->Myanmar: 12.9 BLEU with stable length control.
English->Filipino: 35.4 BLEU after MtPO, combining instruction fidelity and translation quality.
Khmer->English: 44.7 BLEU, reflecting gains from tokenizer expansion.
Vietnamese->English: 37.6 BLEU with consistent improvements across ASEAN language pairs.

Citation

If you find our work helpful, feel free to give us a cite.

@inproceedings{liu2026mtpo,
    title = {Light-TLLM-7B},
    author = {Light-MT Team},
    booktitle = {International Conference on Learning Representations},
    year = {2025},
    url = {https://huggingface.co/qihoo360/Light-TLLM-7B}
}

Disclaimer

This model is provided for research and educational purposes. Please ensure responsible use and compliance with applicable laws and regulations when using this model.

Downloads last month: 35

Safetensors

Model size

333k params

Tensor type

BF16

Model tree for qihoo360/Light-TLLM-7B

Base model

Qwen/Qwen2.5-7B

Finetuned

(691)

this model