Light-TLLM-7B

Hugging Face

Introduction

Light-TLLM-7B is a machine translation focused variant of Qwen2.5-7B developed by 360 AI Research.

This repo contains the machine translation specialized 7B model, which has the following features:

  • Type: Causal Language Models for Machine Translation
  • Training Stage: Continued pretraining, curriculum SFT, and MtPO reinforcement learning
  • Architecture: transformers with RoPE, SwiGLU, RMSNorm, and Attention QKV bias
  • Number of Parameters: 7.61B (6.53B non-embedding)
  • Number of Layers: 28
  • Number of Attention Heads (GQA): 28 for Q and 4 for KV
  • Context Length: Up to 131,072 tokens
  • Vocabulary Size: 180,736 tokens with MtPO vocabulary expansion

Requirements

The code of Light-TLLM-7B is compatible with the latest Hugging Face transformers library. We recommend using the latest version of transformers.

With transformers<4.37.0, you will encounter the following error:

KeyError: 'qwen2'

Quickstart

Here provides a code snippet to show you how to load the tokenizer and model for machine translation tasks.

from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "qihoo360/Light-TLLM-7B"

model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype="auto",
    device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained(model_name)

# Example translation prompt
prompt = "Translate the following English text to Chinese: Hello, how are you today?"
messages = [
    {"role": "system", "content": "You are a professional translator. Translate the given text accurately and naturally."},
    {"role": "user", "content": prompt}
]
text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True
)
model_inputs = tokenizer([text], return_tensors="pt").to(model.device)

generated_ids = model.generate(
    **model_inputs,
    max_new_tokens=512,
    temperature=0.7,
    do_sample=True
)
generated_ids = [
    output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
]

response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
print(response)

Training Pipeline (MtPO)

Runs in four stages from tokenizer expansion to reinforcement learning alignment.

  • Stage 1 - Vocabulary expansion: Extend the Qwen2.5 tokenizer with 3k-4k tokens per target language (Khmer, Lao, Mongolian, Myanmar, Tamil, Thai, Tibetan, Uyghur). FLORES-Plus diagnostics show 2.1x-5.4x compression gains, cutting Khmer token counts from 402 to 103 for representative passages.
  • Stage 2 - Balanced continued pretraining: Continue training on 200B tokens with a 1:1 mix between English and the expanded low-resource corpus to preserve high-resource coverage while materially improving low-resource fluency.
  • Stage 3 - Curriculum SFT: Train on a 7M-sample blend (5:1 general instructions vs. multilingual data) that progresses from base instruction-following to ASEAN translation and mixed-format prompts.
  • Stage 4 - MtPO reinforcement learning: Optimize with entropy-tempered policy updates that keep sampling temperature consistent, apply asymmetric ratio clipping, and normalize advantages at the microbatch level to avoid length bias or entropy collapse.

Verifiable Reward Guardrails

Reinforcement Learning with Verifiable Rewards (RLVR) combines the translation reward model with deterministic validators. During RL we sample K candidates per prompt, score them with RLVR, and keep the top-G diverse outputs for gradient updates. Each candidate is checked for:

  • Length ratio safety relative to the source (default bounds 0.5-2.0 with soft penalties outside range)
  • Structural token preservation for HTML, Markdown, and code blocks using lightweight parsers
  • Target-language verification via a confidence-gated language ID classifier
  • Code-mixing penalties that suppress unintended language drift

These verifiable rewards are added to the semantic score so bad outputs receive immediate negative credit, while high-quality candidates remain eligible for optimization.

Data and Training Budget

Summary of resources and evaluation suites used during MtPO development.

  • Continued pretraining: 200B tokens with adaptive sampling over English, ASEAN, Tibetan, Mongolian, Tamil, and Uyghur corpora
  • Reinforcement learning: 60k steps, batch size 128, top-G candidate selection with RLVR filtering
  • Reward model: Preference data spans ten error categories (accuracy, fluency, terminology, formatting, code-mixing, etc.)
  • Benchmarks: FLORES-Plus (90 directions), BBH, CMMLU, HellaSwag, MMLU

Model Details

  • Model Type: Qwen2-based Causal Language Model
  • Language(s): Multilingual (English, Chinese, Khmer, Lao, Myanmar, Thai, Tibetan, Mongolian, Tamil, Malay, Indonesian, Filipino, Vietnamese, Uyghur, etc.)
  • License: Apache 2.0
  • Finetuned from: Qwen/Qwen2.5-7B
  • Model Size: 7.61B parameters
  • Context Length: 131,072 tokens

Usage

This model is specifically designed for machine translation tasks. It can handle various translation scenarios including:

  • English <-> Chinese translation
  • Multilingual translation tasks
  • Professional document translation
  • Conversational translation

Evaluation

Translation and General Benchmarks

Light-TLLM-7B is evaluated on FLORES-Plus (90 directions) and standard instruction-following benchmarks. Scores below use sacreBLEU (higher is better) and zero-shot accuracy (percentage).

Model Group xx->en en->xx xx->xx Avg. BBH CMMLU HellaSwag MMLU
Gemma3-27B-IT Multilingual chat 36.8 30.7 22.3 24.7 55.9 55.9 55.9 56.0
Qwen3-8B Multilingual chat 31.1 23.3 14.4 16.9 63.8 60.8 26.0 51.3
Qwen2.5-7B-Instruct Multilingual chat 24.8 17.4 9.2 11.6 54.4 64.1 85.2 40.9
Apertus-8B-Instruct Multilingual chat 32.5 25.7 15.6 18.3 49.2 45.3 64.2 45.2
Tower-Plus-9B Multilingual chat 28.2 18.3 9.8 12.5 40.4 57.2 73.1 42.1
Qwen-MT-Plus Translation-focused 34.0 29.6 19.6 22.1 - - - -
Seed-X-PPO-7B Translation-focused 25.9 22.6 10.5 13.3 - - - -
Hunyuan-MT-7B Translation-focused 24.6 23.4 14.8 16.6 - - - -
Light-TLLM-7B-SFT Our models 35.4 32.0 22.7 24.3 59.6 61.4 83.7 47.2
Light-TLLM-7B-RL Our models 36.1 32.7 23.1 24.9 60.9 63.2 85.2 48.5
  • en->xx directions gain +1.1 BLEU over the next best 7B system while preserving reasoning accuracy (+1.3 MMLU over SFT).
  • Average BLEU across all FLORES-Plus directions rises to 24.9 despite the compact 7B footprint.

Tokenizer Efficiency

Vocabulary expansion provides substantial compression on targeted scripts (higher compression ratio means fewer tokens per sentence).

Language Added tokens Old compression ratio New compression ratio Speedup
Khmer 3712 0.85 3.49 4.09x
Lao 3359 0.85 3.05 3.59x
Myanmar 3226 0.69 2.87 4.17x
Thai 2958 1.79 2.97 1.66x
Tibetan 3920 0.75 4.03 5.39x
  • Khmer passages shrink from 402 tokens to 103 tokens in the running example used in the paper.
  • Compression gains translate into lower latency and memory cost during decoding for low-resource scripts.

Constraint Reliability (RLVR)

RLVR introduces deterministic checks that reduce failure modes compared with general chat models and MT baselines.

Model Language targeting Length control Format preservation Code mixing Overall
Light-TLLM-7B-RL 97.8 99.2 92.15 92.3 95.3
Qwen2.5-7B-Instruct 92.0 97.0 51.8 62.8 75.9
Gemma3-27B-IT 97.4 91.6 42.1 90.9 80.5
Qwen-MT-Plus 97.6 99.8 82.5 94.8 93.6
Seed-X-PPO-7B 97.6 79.8 79.0 90.3 86.6
DeepSeek-V3 95.4 95.7 67.6 95.0 88.4
Hunyuan-MT-7B 91.8 90.7 71.1 96.2 87.4
  • Format retention jumps to 92.15 percent versus 51.8 percent for Qwen2.5-7B-Instruct, mitigating HTML or Markdown corruption.
  • Language targeting stays above 97 percent while MtPO avoids verbosity by normalizing advantages at the microbatch level.
  • Overall pass rate reaches 95.3 percent, surpassing Qwen2.5-7B-Instruct by 19.4 points, DeepSeek-V3 by 6.9 points, and Qwen-MT-Plus by 1.7 points despite identical constraint settings.

Per-Language FLORES Highlights

  • English->Thai: 34.1 BLEU, +1.5 over Qwen-MT-Plus.
  • English->Myanmar: 12.9 BLEU with stable length control.
  • English->Filipino: 35.4 BLEU after MtPO, combining instruction fidelity and translation quality.
  • Khmer->English: 44.7 BLEU, reflecting gains from tokenizer expansion.
  • Vietnamese->English: 37.6 BLEU with consistent improvements across ASEAN language pairs.

Citation

If you find our work helpful, feel free to give us a cite.

@inproceedings{liu2026mtpo,
    title = {Light-TLLM-7B},
    author = {Light-MT Team},
    booktitle = {International Conference on Learning Representations},
    year = {2025},
    url = {https://huggingface.co/qihoo360/Light-TLLM-7B}
}

Disclaimer

This model is provided for research and educational purposes. Please ensure responsible use and compliance with applicable laws and regulations when using this model.

Downloads last month
35
Safetensors
Model size
333k params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for qihoo360/Light-TLLM-7B

Base model

Qwen/Qwen2.5-7B
Finetuned
(691)
this model