T-pro-it-2.0-FP8

Main BF16 model: t-tech/T-pro-it-2.0

🚨 Users are advised to exercise caution and are responsible for any additional training and oversight required to ensure the model's responses meet acceptable ethical and safety standards. The responsibility for incorporating this model into industrial or commercial solutions lies entirely with those who choose to deploy it.

T‑pro‑it‑2.0‑FP8 is a fine‑grained FP8‑quantised version of T‑pro‑it‑2.0 (built on the Qwen‑3 family). It delivers identical capabilities with roughly half the memory footprint and higher inference speed.

Description

T-pro-it-2.0 is a model built upon the Qwen 3 model family and incorporates both continual pre-training and alignment techniques.

📚 Dataset

Instruction Pre-Training: 40B tokens of instruction data, with one-third focused on reasoning tasks.

Supervised Fine-Tuning (SFT): ~500K high-quality and diverse instructions with balanced complexity. Reasoning tasks make up about 20% of the dataset.

Preference Tuning: ~100K carefully selected instructions, filtered by length and type for general tasks and with domain-balanced selection for reasoning tasks.

📊 Benchmarks

TBD

Note on FP8

For convenience and performance, we have provided fp8-quantized model checkpoint for T-pro-it-2.0, whose name ends with -FP8. The quantization method is fine-grained fp8 quantization with block size of 128. You can find more details in the quantization_config field in config.json.

You can use the T-pro-it-2.0-FP8 model with serveral inference frameworks, including transformers, sglang, and vllm, as the original bfloat16 model. However, please pay attention to the following known issues:

transformers:
- there are currently issues with the "fine-grained fp8" method in transformers for distributed inference. You may need to set the environment variable CUDA_LAUNCH_BLOCKING=1 if multiple devices are used in inference.

Switching Between Thinking and Non‑Thinking Modes

To enable or disable reasoning mode in HuggingFace, set the enable_thinking flag in tokenizer.apply_chat_template.
For more details, see:

Recommended Generation Parameters

Mode	Temperature	presence_penalty
No‑think (general requests)	≤ 0.3	1.0
Think mode (standard requests)	≈ 0.6	1.0
Complex reasoning requests	≥ 0.8	1.0

👨‍💻 Examples of usage

HF Usage

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
torch.manual_seed(42)

model_name = "t-tech/T-pro-it-2.0-FP8"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
    model_name, 
    torch_dtype="auto",
    device_map="auto",
)

prompt = (
    "Пожалуйста, вычисли определённый интеграл ∫_0^1 x² eˣ dx, "
    "пошагово объясни решение и укажи окончательный результат."
)
messages = [
    {"role": "system", "content": "Ты T-pro, виртуальный ассистент в Т-Технологии. Твоя задача - быть полезным диалоговым ассистентом."},
    {"role": "user", "content": prompt}
]
text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True,
    enable_thinking=True # Switches between thinking and non-thinking modes. Default is True.
)
model_inputs = tokenizer([text], return_tensors="pt").to(model.device)

generated_ids = model.generate(
    **model_inputs,
    max_new_tokens=512
)
generated_ids = [
    output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
]

response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]

print(response)

Deployment

For deployment, you can use sglang>=0.4.6.post1 or vllm>=0.8.5 or to create an OpenAI-compatible API endpoint:

SGLang:

python -m sglang.launch_server --model-path t-tech/T‑pro‑it‑2.0‑FP8 --reasoning-parser qwen3

vLLM:

vllm serve t-tech/T‑pro‑it‑2.0‑FP8 --enable-reasoning --reasoning-parser qwen3

t-tech
/

T-pro-it-2.0-FP8