DeepGemma-E4B-Reasoning
A reasoning fine-tune of google/gemma-4-E4B-IT via LoRA (rank 32), trained to produce
explicit step-by-step thinking before every final answer — similar to o1-style chain-of-thought.
Base model: google/gemma-4-E4B-IT
Adapter size: ~339 MB
Hardware: RTX 4090 (24 GB VRAM)
Framework: Unsloth + TRL SFTTrainer
Evaluation results
Evaluated on 50 samples per benchmark against the unmodified google/gemma-4-E4B-IT base.
| Benchmark | Base | DeepGemma | Δ |
|---|---|---|---|
| 🔢 GSM8K (Math) | 44.0% | 62.0% | ▲ +18.0% |
| 💡 HellaSwag (Commonsense) | 80.0% | 86.0% | ▲ +6.0% |
| 🔬 ARC-Challenge (Science) | 92.0% | 96.0% | ▲ +4.0% |
| 🧠 TruthfulQA (Facts) | 70.0% | 79.0% | ▲ +9.0% |
| 📚 MMLU (Mixed) | 60.0% | 69.0% | ▲ +9.0% |
| Overall | 70.0% | 79.2% | ▲ +9.2% |
DeepGemma-E4B-Reasoning outperforms the base model across all five benchmarks.
The largest gain is in mathematical reasoning — GSM8K improves by +18 percentage points (44% → 62%). This is the direct effect of training on chain-of-thought distillation data: the model learns to decompose multi-step word problems into explicit intermediate steps rather than jumping to a final answer. When you force a model to write out its reasoning, arithmetic errors become visible and self-correctable mid-generation.
Commonsense reasoning (HellaSwag +6%) and factual accuracy (TruthfulQA +9%) also benefit significantly — structured thinking helps the model rule out implausible answer choices before committing. The MMLU improvement (+9%) across diverse academic subjects suggests the reasoning fine-tune generalizes well beyond pure math, likely because the training corpus included science, philosophy, and logic traces in addition to mathematical problems.
ARC-Challenge was already near ceiling at 92%, so the +4% gain there is meaningful given how little headroom remained.
What changed
The adapter teaches the model to route its internal reasoning through a dedicated thinking channel before outputting the final response. Training data was sourced exclusively from high-reasoning traces distilled from frontier models.
Training data (≈ 30 000 deduplicated pairs)
| Source | Size |
|---|---|
| crownelius/Opus-4.6-Reasoning-3000x | ~3 000 |
| Jackrong/Qwen3.5-reasoning-700x | ~700 |
| TeichAI/claude-4.5-opus-high-reasoning-250x | ~250 |
| Roman1111111/gemini-3.1-pro-hard-high-reasoning | full |
| ianncity/KIMI-K2.5-1000000x (General-Distillation) | 15 000 |
| Roman1111111/claude-opus-4.6-10000x | ~10 000 |
| Jackrong/GLM5.1-Reasoning-1M-Cleaned (PHD-Science) | 3 000 |
| Jackrong/GLM5.1-Reasoning-1M-Cleaned (Math) | 2 000 |
Training config
| Parameter | Value |
|---|---|
| Max sequence length | 3 072 |
| LoRA rank / alpha | 32 / 32 |
| Epochs | 2 |
| Learning rate | 2e-4 (cosine) |
| Effective batch size | 8 (batch 1 × grad_accum 8) |
| Optimizer | adamw_8bit |
| BF16 | yes |
Output format
When triggered correctly, the model wraps its reasoning in a thinking channel and then gives the final answer:
<|channel>thought
1. First I analyze...
2. Then I consider...
3. Therefore...
<channel|>
Final answer here.
Quick start
Critical: include
<|think|>in the system prompt — without it the thinking channel will not activate.
With PEFT + transformers
from peft import PeftModel
from transformers import AutoModelForCausalLM, AutoTokenizer
base = "google/gemma-4-E4B-IT"
adapter = "Zhantas/DeepGemma-E4B-Reasoning"
tokenizer = AutoTokenizer.from_pretrained(base)
model = AutoModelForCausalLM.from_pretrained(base, device_map="auto")
model = PeftModel.from_pretrained(model, adapter)
messages = [
{"role": "system", "content": "<|think|>"},
{"role": "user", "content": "How many r's are in strawberry? Think step by step."},
]
inputs = tokenizer.apply_chat_template(
messages,
tokenize=True,
add_generation_prompt=True,
return_tensors="pt",
).to("cuda")
out = model.generate(
inputs,
max_new_tokens=512,
temperature=0.7,
top_p=0.95,
do_sample=True,
)
print(tokenizer.decode(out[0], skip_special_tokens=False))
With Unsloth (faster inference)
from unsloth import FastLanguageModel
model, tokenizer = FastLanguageModel.from_pretrained(
"Zhantas/DeepGemma-E4B-Reasoning",
load_in_4bit=True,
)
FastLanguageModel.for_inference(model)
tok = tokenizer.tokenizer if hasattr(tokenizer, "tokenizer") else tokenizer
prompt = tok.apply_chat_template(
[
{"role": "system", "content": "<|think|>"},
{"role": "user", "content": "Prove that sqrt(2) is irrational."},
],
tokenize=False,
add_generation_prompt=True,
)
inputs = tok(prompt, return_tensors="pt").to("cuda")
out = model.generate(**inputs, max_new_tokens=1024, temperature=0.7, top_p=0.95, do_sample=True)
print(tok.decode(out[0], skip_special_tokens=False))
Limitations
- The adapter was trained on English and Russian data; performance on other languages is untested.
- Very long reasoning chains (> 3 072 tokens total) may be truncated.
- As a LoRA adapter the base model weights are unchanged; all base model limitations apply.
License
This adapter inherits the Gemma license. Use is subject to Google's Gemma terms of service.
Trained with Unsloth.
- Downloads last month
- 73
Datasets used to train Zhantas/DeepGemma-E4B-Reasoning
Evaluation results
- accuracy on GSM8Ktest set self-reported62.000
- accuracy on HellaSwagvalidation set self-reported86.000
- accuracy on ARC-Challengetest set self-reported96.000
- accuracy on TruthfulQAvalidation set self-reported79.000
- accuracy on MMLUtest set self-reported69.000
