DeepGemma-E4B-Reasoning

A reasoning fine-tune of google/gemma-4-E4B-IT via LoRA (rank 32), trained to produce explicit step-by-step thinking before every final answer — similar to o1-style chain-of-thought.

Base model: google/gemma-4-E4B-IT
Adapter size: ~339 MB
Hardware: RTX 4090 (24 GB VRAM)
Framework: Unsloth + TRL SFTTrainer

Evaluation results

Evaluated on 50 samples per benchmark against the unmodified google/gemma-4-E4B-IT base.

Benchmark	Base	DeepGemma	Δ
🔢 GSM8K (Math)	44.0%	62.0%	▲ +18.0%
💡 HellaSwag (Commonsense)	80.0%	86.0%	▲ +6.0%
🔬 ARC-Challenge (Science)	92.0%	96.0%	▲ +4.0%
🧠 TruthfulQA (Facts)	70.0%	79.0%	▲ +9.0%
📚 MMLU (Mixed)	60.0%	69.0%	▲ +9.0%
Overall	70.0%	79.2%	▲ +9.2%

DeepGemma-E4B-Reasoning outperforms the base model across all five benchmarks.

The largest gain is in mathematical reasoning — GSM8K improves by +18 percentage points (44% → 62%). This is the direct effect of training on chain-of-thought distillation data: the model learns to decompose multi-step word problems into explicit intermediate steps rather than jumping to a final answer. When you force a model to write out its reasoning, arithmetic errors become visible and self-correctable mid-generation.

Commonsense reasoning (HellaSwag +6%) and factual accuracy (TruthfulQA +9%) also benefit significantly — structured thinking helps the model rule out implausible answer choices before committing. The MMLU improvement (+9%) across diverse academic subjects suggests the reasoning fine-tune generalizes well beyond pure math, likely because the training corpus included science, philosophy, and logic traces in addition to mathematical problems.

ARC-Challenge was already near ceiling at 92%, so the +4% gain there is meaningful given how little headroom remained.

What changed

The adapter teaches the model to route its internal reasoning through a dedicated thinking channel before outputting the final response. Training data was sourced exclusively from high-reasoning traces distilled from frontier models.

Training data (≈ 30 000 deduplicated pairs)

Source	Size
crownelius/Opus-4.6-Reasoning-3000x	~3 000
Jackrong/Qwen3.5-reasoning-700x	~700
TeichAI/claude-4.5-opus-high-reasoning-250x	~250
Roman1111111/gemini-3.1-pro-hard-high-reasoning	full
ianncity/KIMI-K2.5-1000000x (General-Distillation)	15 000
Roman1111111/claude-opus-4.6-10000x	~10 000
Jackrong/GLM5.1-Reasoning-1M-Cleaned (PHD-Science)	3 000
Jackrong/GLM5.1-Reasoning-1M-Cleaned (Math)	2 000

Training config

Parameter	Value
Max sequence length	3 072
LoRA rank / alpha	32 / 32
Epochs	2
Learning rate	2e-4 (cosine)
Effective batch size	8 (batch 1 × grad_accum 8)
Optimizer	adamw_8bit
BF16	yes

Output format

When triggered correctly, the model wraps its reasoning in a thinking channel and then gives the final answer:

<|channel>thought
1. First I analyze...
2. Then I consider...
3. Therefore...
<channel|>
Final answer here.

Quick start

Critical: include <|think|> in the system prompt — without it the thinking channel will not activate.

With PEFT + transformers

from peft import PeftModel
from transformers import AutoModelForCausalLM, AutoTokenizer

base = "google/gemma-4-E4B-IT"
adapter = "Zhantas/DeepGemma-E4B-Reasoning"

tokenizer = AutoTokenizer.from_pretrained(base)
model = AutoModelForCausalLM.from_pretrained(base, device_map="auto")
model = PeftModel.from_pretrained(model, adapter)

messages = [
    {"role": "system",    "content": "<|think|>"},
    {"role": "user",      "content": "How many r's are in strawberry? Think step by step."},
]

inputs = tokenizer.apply_chat_template(
    messages,
    tokenize=True,
    add_generation_prompt=True,
    return_tensors="pt",
).to("cuda")

out = model.generate(
    inputs,
    max_new_tokens=512,
    temperature=0.7,
    top_p=0.95,
    do_sample=True,
)
print(tokenizer.decode(out[0], skip_special_tokens=False))

With Unsloth (faster inference)

from unsloth import FastLanguageModel

model, tokenizer = FastLanguageModel.from_pretrained(
    "Zhantas/DeepGemma-E4B-Reasoning",
    load_in_4bit=True,
)
FastLanguageModel.for_inference(model)

tok = tokenizer.tokenizer if hasattr(tokenizer, "tokenizer") else tokenizer

prompt = tok.apply_chat_template(
    [
        {"role": "system", "content": "<|think|>"},
        {"role": "user",   "content": "Prove that sqrt(2) is irrational."},
    ],
    tokenize=False,
    add_generation_prompt=True,
)

inputs = tok(prompt, return_tensors="pt").to("cuda")
out = model.generate(**inputs, max_new_tokens=1024, temperature=0.7, top_p=0.95, do_sample=True)
print(tok.decode(out[0], skip_special_tokens=False))

Limitations

The adapter was trained on English and Russian data; performance on other languages is untested.
Very long reasoning chains (> 3 072 tokens total) may be truncated.
As a LoRA adapter the base model weights are unchanged; all base model limitations apply.

License

This adapter inherits the Gemma license. Use is subject to Google's Gemma terms of service.

Trained with Unsloth.

Downloads last month: 6

Datasets used to train Zhantas/DeepGemma-E4B-Reasoning

Evaluation results

accuracy on GSM8K
test set self-reported

62.000
accuracy on HellaSwag
validation set self-reported

86.000
accuracy on ARC-Challenge
test set self-reported

96.000
accuracy on TruthfulQA
validation set self-reported

79.000
accuracy on MMLU
test set self-reported

69.000