Instructions to use mlx-community/gemma-4-e4b-it-OptiQ-4bit with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- MLX
How to use mlx-community/gemma-4-e4b-it-OptiQ-4bit with MLX:
# Make sure mlx-lm is installed # pip install --upgrade mlx-lm # Generate text with mlx-lm from mlx_lm import load, generate model, tokenizer = load("mlx-community/gemma-4-e4b-it-OptiQ-4bit") prompt = "Write a story about Einstein" messages = [{"role": "user", "content": prompt}] prompt = tokenizer.apply_chat_template( messages, add_generation_prompt=True ) text = generate(model, tokenizer, prompt=prompt, verbose=True) - Notebooks
- Google Colab
- Kaggle
- Local Apps
- LM Studio
- Pi new
How to use mlx-community/gemma-4-e4b-it-OptiQ-4bit with Pi:
Start the MLX server
# Install MLX LM: uv tool install mlx-lm # Start a local OpenAI-compatible server: mlx_lm.server --model "mlx-community/gemma-4-e4b-it-OptiQ-4bit"
Configure the model in Pi
# Install Pi: npm install -g @mariozechner/pi-coding-agent # Add to ~/.pi/agent/models.json: { "providers": { "mlx-lm": { "baseUrl": "http://localhost:8080/v1", "api": "openai-completions", "apiKey": "none", "models": [ { "id": "mlx-community/gemma-4-e4b-it-OptiQ-4bit" } ] } } }Run Pi
# Start Pi in your project directory: pi
- Hermes Agent new
How to use mlx-community/gemma-4-e4b-it-OptiQ-4bit with Hermes Agent:
Start the MLX server
# Install MLX LM: uv tool install mlx-lm # Start a local OpenAI-compatible server: mlx_lm.server --model "mlx-community/gemma-4-e4b-it-OptiQ-4bit"
Configure Hermes
# Install Hermes: curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash hermes setup # Point Hermes at the local server: hermes config set model.provider custom hermes config set model.base_url http://127.0.0.1:8080/v1 hermes config set model.default mlx-community/gemma-4-e4b-it-OptiQ-4bit
Run Hermes
hermes
- MLX LM
How to use mlx-community/gemma-4-e4b-it-OptiQ-4bit with MLX LM:
Generate or start a chat session
# Install MLX LM uv tool install mlx-lm # Interactive chat REPL mlx_lm.chat --model "mlx-community/gemma-4-e4b-it-OptiQ-4bit"
Run an OpenAI-compatible server
# Install MLX LM uv tool install mlx-lm # Start the server mlx_lm.server --model "mlx-community/gemma-4-e4b-it-OptiQ-4bit" # Calling the OpenAI-compatible server with curl curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "mlx-community/gemma-4-e4b-it-OptiQ-4bit", "messages": [ {"role": "user", "content": "Hello"} ] }'
mlx-community/gemma-4-e4b-it-OptiQ-4bit
A 4-bit mixed-precision MLX quant produced by mlx-optiq, the sensitivity-aware quantization toolkit for Apple Silicon.
A 4-bit mixed-precision MLX quant of google/gemma-4-e4b-it. Per-layer bit-widths come from a KL-divergence sensitivity pass on a six-domain calibration mix (prose · reasoning · code · agent · tool-call · constraint-bearing instructions). Sensitive layers go to 8-bit; robust ones stay at 4-bit. The on-disk size is within ~5 % of a stock uniform 4-bit MLX quant.
Quantization details
| Property | Value |
|---|---|
| Predominant precision | 4-bit |
| Layers at 8-bit (sensitive) | 155 |
| Layers at 4-bit (robust) | 224 |
| Total quantized layers | 379 |
| Group size | 64 |
| Calibration mix | six-domain mix (40 samples × 6 domains) |
| Reference for sensitivity | bf16 (auto-resolved; falls back to uniform-4-bit if bf16 doesn't fit) |
We follow the same naming convention llama.cpp uses for Q4_K_M and similar mixed-precision quants: the "4-bit" label is for the predominant precision, not the weighted average. The mixed allocation is what lets this build beat stock uniform-4-bit at the same disk size. Benchmark deltas are below.
Usage
Load it with mlx-lm and use it as usual:
pip install mlx-lm
from mlx_lm import load, generate
model, tokenizer = load("mlx-community/gemma-4-e4b-it-OptiQ-4bit")
response = generate(
model, tokenizer,
prompt="Explain quantum computing in simple terms.",
max_tokens=200,
)
For more (mixed-precision KV-cache serving, sensitivity-aware LoRA fine-tuning, OpenAI + Anthropic-compatible inference server, hot-swap mounted adapters, sandboxed Python execution for agent workflows), install mlx-optiq:
pip install mlx-optiq
See the Gemma-4 family guide on mlx-optiq.com for sampling defaults, training recipes, and family-specific caveats.
Benchmarks
Five-metric suite that drives the Capability Score:
| Metric | Score |
|---|---|
| MMLU (5-shot, 1000 samples) | 58.8% |
| GSM8K (1000 samples, 3-shot CoT) | 77.8% |
| IFEval (full set, strict) | 70.6% |
| BFCL-V3 simple (200 single-turn calls) | 69.0% |
| HumanEval (164 problems, pass@1) | 76.8% |
| Capability Score (mean of the 5 benchmarks above) | 70.6 |
| KL vs bf16 reference (mean / p95) | 0.2755 / 1.3460 |
| On-disk size | 6.1 GB |
The Capability Score is the simple unweighted mean of the five benchmarks. Every metric gets one equal vote. Disk size is reported next to it as an honest second axis instead of being folded into the score. See the eval-framework writeup for the full methodology.
Links
- Project website: mlx-optiq.com
- Gemma-4 family guide: mlx-optiq.com/docs/gemma-4
- PyPI: pypi.org/project/mlx-optiq
- Calibration mix: mlx-optiq.com/blog/calibration-mix
- Eval framework: mlx-optiq.com/blog/eval-framework
- Base model: google/gemma-4-e4b-it
License
Gemma license (inherits from base model). See https://ai.google.dev/gemma/terms for the terms of use.
- Downloads last month
- 10,420
4-bit