DeepGemma-E4B-Reasoning banner

DeepGemma-E4B-Reasoning

A reasoning fine-tune of google/gemma-4-E4B-IT via LoRA (rank 32), trained to produce explicit step-by-step thinking before every final answer — similar to o1-style chain-of-thought.

Base model: google/gemma-4-E4B-IT
Adapter size: ~339 MB
Hardware: RTX 4090 (24 GB VRAM)
Framework: Unsloth + TRL SFTTrainer


Evaluation results

Evaluated on 50 samples per benchmark against the unmodified google/gemma-4-E4B-IT base.

Benchmark Base DeepGemma Δ
🔢 GSM8K (Math) 44.0% 62.0% ▲ +18.0%
💡 HellaSwag (Commonsense) 80.0% 86.0% ▲ +6.0%
🔬 ARC-Challenge (Science) 92.0% 96.0% ▲ +4.0%
🧠 TruthfulQA (Facts) 70.0% 79.0% ▲ +9.0%
📚 MMLU (Mixed) 60.0% 69.0% ▲ +9.0%
Overall 70.0% 79.2% ▲ +9.2%

DeepGemma-E4B-Reasoning outperforms the base model across all five benchmarks.

The largest gain is in mathematical reasoning — GSM8K improves by +18 percentage points (44% → 62%). This is the direct effect of training on chain-of-thought distillation data: the model learns to decompose multi-step word problems into explicit intermediate steps rather than jumping to a final answer. When you force a model to write out its reasoning, arithmetic errors become visible and self-correctable mid-generation.

Commonsense reasoning (HellaSwag +6%) and factual accuracy (TruthfulQA +9%) also benefit significantly — structured thinking helps the model rule out implausible answer choices before committing. The MMLU improvement (+9%) across diverse academic subjects suggests the reasoning fine-tune generalizes well beyond pure math, likely because the training corpus included science, philosophy, and logic traces in addition to mathematical problems.

ARC-Challenge was already near ceiling at 92%, so the +4% gain there is meaningful given how little headroom remained.


What changed

The adapter teaches the model to route its internal reasoning through a dedicated thinking channel before outputting the final response. Training data was sourced exclusively from high-reasoning traces distilled from frontier models.

Training data (≈ 30 000 deduplicated pairs)

Source Size
crownelius/Opus-4.6-Reasoning-3000x ~3 000
Jackrong/Qwen3.5-reasoning-700x ~700
TeichAI/claude-4.5-opus-high-reasoning-250x ~250
Roman1111111/gemini-3.1-pro-hard-high-reasoning full
ianncity/KIMI-K2.5-1000000x (General-Distillation) 15 000
Roman1111111/claude-opus-4.6-10000x ~10 000
Jackrong/GLM5.1-Reasoning-1M-Cleaned (PHD-Science) 3 000
Jackrong/GLM5.1-Reasoning-1M-Cleaned (Math) 2 000

Training config

Parameter Value
Max sequence length 3 072
LoRA rank / alpha 32 / 32
Epochs 2
Learning rate 2e-4 (cosine)
Effective batch size 8 (batch 1 × grad_accum 8)
Optimizer adamw_8bit
BF16 yes

Output format

When triggered correctly, the model wraps its reasoning in a thinking channel and then gives the final answer:

<|channel>thought
1. First I analyze...
2. Then I consider...
3. Therefore...
<channel|>
Final answer here.

Quick start

Critical: include <|think|> in the system prompt — without it the thinking channel will not activate.

With PEFT + transformers

from peft import PeftModel
from transformers import AutoModelForCausalLM, AutoTokenizer

base = "google/gemma-4-E4B-IT"
adapter = "Zhantas/DeepGemma-E4B-Reasoning"

tokenizer = AutoTokenizer.from_pretrained(base)
model = AutoModelForCausalLM.from_pretrained(base, device_map="auto")
model = PeftModel.from_pretrained(model, adapter)

messages = [
    {"role": "system",    "content": "<|think|>"},
    {"role": "user",      "content": "How many r's are in strawberry? Think step by step."},
]

inputs = tokenizer.apply_chat_template(
    messages,
    tokenize=True,
    add_generation_prompt=True,
    return_tensors="pt",
).to("cuda")

out = model.generate(
    inputs,
    max_new_tokens=512,
    temperature=0.7,
    top_p=0.95,
    do_sample=True,
)
print(tokenizer.decode(out[0], skip_special_tokens=False))

With Unsloth (faster inference)

from unsloth import FastLanguageModel

model, tokenizer = FastLanguageModel.from_pretrained(
    "Zhantas/DeepGemma-E4B-Reasoning",
    load_in_4bit=True,
)
FastLanguageModel.for_inference(model)

tok = tokenizer.tokenizer if hasattr(tokenizer, "tokenizer") else tokenizer

prompt = tok.apply_chat_template(
    [
        {"role": "system", "content": "<|think|>"},
        {"role": "user",   "content": "Prove that sqrt(2) is irrational."},
    ],
    tokenize=False,
    add_generation_prompt=True,
)

inputs = tok(prompt, return_tensors="pt").to("cuda")
out = model.generate(**inputs, max_new_tokens=1024, temperature=0.7, top_p=0.95, do_sample=True)
print(tok.decode(out[0], skip_special_tokens=False))

Limitations

  • The adapter was trained on English and Russian data; performance on other languages is untested.
  • Very long reasoning chains (> 3 072 tokens total) may be truncated.
  • As a LoRA adapter the base model weights are unchanged; all base model limitations apply.

License

This adapter inherits the Gemma license. Use is subject to Google's Gemma terms of service.


Trained with Unsloth.

Downloads last month
73
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Datasets used to train Zhantas/DeepGemma-E4B-Reasoning

Evaluation results