How to use from the
Use from the
PEFT library
from peft import PeftModel
from transformers import AutoModelForCausalLM

base_model = AutoModelForCausalLM.from_pretrained("unsloth/gemma-4-e4b-it-unsloth-bnb-4bit")
model = PeftModel.from_pretrained(base_model, "Zhantas/DeepGemma-E4B-Reasoning")

DeepGemma-E4B-Reasoning banner

DeepGemma-E4B-Reasoning

A reasoning fine-tune of google/gemma-4-E4B-IT via LoRA (rank 32), trained to produce explicit step-by-step thinking before every final answer — similar to o1-style chain-of-thought.

Base model: google/gemma-4-E4B-IT
Adapter size: ~339 MB
Hardware: RTX 4090 (24 GB VRAM)
Framework: Unsloth + TRL SFTTrainer


Evaluation results

Evaluated on 50 samples per benchmark against the unmodified google/gemma-4-E4B-IT base.

Benchmark Base DeepGemma Δ
🔢 GSM8K (Math) 44.0% 62.0% ▲ +18.0%
💡 HellaSwag (Commonsense) 80.0% 86.0% ▲ +6.0%
🔬 ARC-Challenge (Science) 92.0% 96.0% ▲ +4.0%
🧠 TruthfulQA (Facts) 70.0% 79.0% ▲ +9.0%
📚 MMLU (Mixed) 60.0% 69.0% ▲ +9.0%
Overall 70.0% 79.2% ▲ +9.2%

DeepGemma-E4B-Reasoning outperforms the base model across all five benchmarks.

The largest gain is in mathematical reasoning — GSM8K improves by +18 percentage points (44% → 62%). This is the direct effect of training on chain-of-thought distillation data: the model learns to decompose multi-step word problems into explicit intermediate steps rather than jumping to a final answer. When you force a model to write out its reasoning, arithmetic errors become visible and self-correctable mid-generation.

Commonsense reasoning (HellaSwag +6%) and factual accuracy (TruthfulQA +9%) also benefit significantly — structured thinking helps the model rule out implausible answer choices before committing. The MMLU improvement (+9%) across diverse academic subjects suggests the reasoning fine-tune generalizes well beyond pure math, likely because the training corpus included science, philosophy, and logic traces in addition to mathematical problems.

ARC-Challenge was already near ceiling at 92%, so the +4% gain there is meaningful given how little headroom remained.


What changed

The adapter teaches the model to route its internal reasoning through a dedicated thinking channel before outputting the final response. Training data was sourced exclusively from high-reasoning traces distilled from frontier models.

Training data (≈ 30 000 deduplicated pairs)

Source Size
crownelius/Opus-4.6-Reasoning-3000x ~3 000
Jackrong/Qwen3.5-reasoning-700x ~700
TeichAI/claude-4.5-opus-high-reasoning-250x ~250
Roman1111111/gemini-3.1-pro-hard-high-reasoning full
ianncity/KIMI-K2.5-1000000x (General-Distillation) 15 000
Roman1111111/claude-opus-4.6-10000x ~10 000
Jackrong/GLM5.1-Reasoning-1M-Cleaned (PHD-Science) 3 000
Jackrong/GLM5.1-Reasoning-1M-Cleaned (Math) 2 000

Training config

Parameter Value
Max sequence length 3 072
LoRA rank / alpha 32 / 32
Epochs 2
Learning rate 2e-4 (cosine)
Effective batch size 8 (batch 1 × grad_accum 8)
Optimizer adamw_8bit
BF16 yes

Output format

When triggered correctly, the model wraps its reasoning in a thinking channel and then gives the final answer:

<|channel>thought
1. First I analyze...
2. Then I consider...
3. Therefore...
<channel|>
Final answer here.

Quick start

Critical: include <|think|> in the system prompt — without it the thinking channel will not activate.

With PEFT + transformers

from peft import PeftModel
from transformers import AutoModelForCausalLM, AutoTokenizer

base = "google/gemma-4-E4B-IT"
adapter = "Zhantas/DeepGemma-E4B-Reasoning"

tokenizer = AutoTokenizer.from_pretrained(base)
model = AutoModelForCausalLM.from_pretrained(base, device_map="auto")
model = PeftModel.from_pretrained(model, adapter)

messages = [
    {"role": "system",    "content": "<|think|>"},
    {"role": "user",      "content": "How many r's are in strawberry? Think step by step."},
]

inputs = tokenizer.apply_chat_template(
    messages,
    tokenize=True,
    add_generation_prompt=True,
    return_tensors="pt",
).to("cuda")

out = model.generate(
    inputs,
    max_new_tokens=512,
    temperature=0.7,
    top_p=0.95,
    do_sample=True,
)
print(tokenizer.decode(out[0], skip_special_tokens=False))

With Unsloth (faster inference)

from unsloth import FastLanguageModel

model, tokenizer = FastLanguageModel.from_pretrained(
    "Zhantas/DeepGemma-E4B-Reasoning",
    load_in_4bit=True,
)
FastLanguageModel.for_inference(model)

tok = tokenizer.tokenizer if hasattr(tokenizer, "tokenizer") else tokenizer

prompt = tok.apply_chat_template(
    [
        {"role": "system", "content": "<|think|>"},
        {"role": "user",   "content": "Prove that sqrt(2) is irrational."},
    ],
    tokenize=False,
    add_generation_prompt=True,
)

inputs = tok(prompt, return_tensors="pt").to("cuda")
out = model.generate(**inputs, max_new_tokens=1024, temperature=0.7, top_p=0.95, do_sample=True)
print(tok.decode(out[0], skip_special_tokens=False))

Limitations

  • The adapter was trained on English and Russian data; performance on other languages is untested.
  • Very long reasoning chains (> 3 072 tokens total) may be truncated.
  • As a LoRA adapter the base model weights are unchanged; all base model limitations apply.

License

This adapter inherits the Gemma license. Use is subject to Google's Gemma terms of service.


Trained with Unsloth.

Downloads last month
6
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Datasets used to train Zhantas/DeepGemma-E4B-Reasoning

Evaluation results