Instructions to use chill123/antonio-gemma3-evo-q4 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- llama-cpp-python
How to use chill123/antonio-gemma3-evo-q4 with llama-cpp-python:
# !pip install llama-cpp-python from llama_cpp import Llama llm = Llama.from_pretrained( repo_id="chill123/antonio-gemma3-evo-q4", filename="gemma3-1b-q4_0.gguf", )
llm.create_chat_completion( messages = "No input example has been defined for this model task." )
- Notebooks
- Google Colab
- Kaggle
- Local Apps
- llama.cpp
How to use chill123/antonio-gemma3-evo-q4 with llama.cpp:
Install from brew
brew install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf chill123/antonio-gemma3-evo-q4:Q4_0 # Run inference directly in the terminal: llama-cli -hf chill123/antonio-gemma3-evo-q4:Q4_0
Install from WinGet (Windows)
winget install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf chill123/antonio-gemma3-evo-q4:Q4_0 # Run inference directly in the terminal: llama-cli -hf chill123/antonio-gemma3-evo-q4:Q4_0
Use pre-built binary
# Download pre-built binary from: # https://github.com/ggerganov/llama.cpp/releases # Start a local OpenAI-compatible server with a web UI: ./llama-server -hf chill123/antonio-gemma3-evo-q4:Q4_0 # Run inference directly in the terminal: ./llama-cli -hf chill123/antonio-gemma3-evo-q4:Q4_0
Build from source code
git clone https://github.com/ggerganov/llama.cpp.git cd llama.cpp cmake -B build cmake --build build -j --target llama-server llama-cli # Start a local OpenAI-compatible server with a web UI: ./build/bin/llama-server -hf chill123/antonio-gemma3-evo-q4:Q4_0 # Run inference directly in the terminal: ./build/bin/llama-cli -hf chill123/antonio-gemma3-evo-q4:Q4_0
Use Docker
docker model run hf.co/chill123/antonio-gemma3-evo-q4:Q4_0
- LM Studio
- Jan
- Ollama
How to use chill123/antonio-gemma3-evo-q4 with Ollama:
ollama run hf.co/chill123/antonio-gemma3-evo-q4:Q4_0
- Unsloth Studio new
How to use chill123/antonio-gemma3-evo-q4 with Unsloth Studio:
Install Unsloth Studio (macOS, Linux, WSL)
curl -fsSL https://unsloth.ai/install.sh | sh # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for chill123/antonio-gemma3-evo-q4 to start chatting
Install Unsloth Studio (Windows)
irm https://unsloth.ai/install.ps1 | iex # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for chill123/antonio-gemma3-evo-q4 to start chatting
Using HuggingFace Spaces for Unsloth
# No setup required # Open https://huggingface.co/spaces/unsloth/studio in your browser # Search for chill123/antonio-gemma3-evo-q4 to start chatting
- Docker Model Runner
How to use chill123/antonio-gemma3-evo-q4 with Docker Model Runner:
docker model run hf.co/chill123/antonio-gemma3-evo-q4:Q4_0
- Lemonade
How to use chill123/antonio-gemma3-evo-q4 with Lemonade:
Pull the model
# Download Lemonade from https://lemonade-server.ai/ lemonade pull chill123/antonio-gemma3-evo-q4:Q4_0
Run and chat with the model
lemonade run user.antonio-gemma3-evo-q4-Q4_0
List all available models
lemonade list
- π§ Antonio Gemma3 Evo Q4 β Self-Learning AI for Raspberry Pi
- π» Production-Ready for Raspberry Pi
- π What Makes It Special
- π Benchmark Results (Updated Oct 21, 2025)
- π§© Available Models
- π― Important: Two Usage Modes
- π οΈ Quick Start Options
- π Quick Start with Full Evolution Stack
- π‘ Key Features
- ποΈ Architecture
- π― Use Cases
- π Links
- π License
- π» Production-Ready for Raspberry Pi
π§ Antonio Gemma3 Evo Q4 β Self-Learning AI for Raspberry Pi
Antonio Gemma3 Evo Q4 is not just another quantized LLM. It's a self-learning micro-intelligence with EvoMemoryβ’, RAG-Lite, and auto-evolution capabilities, optimized for Raspberry Pi 4 & 5 and tested for production 24/7 deployment.
Version: v0.5.0 (NEW: Adaptive Prompting) Author: Antonio Consales (antconsales) Base Model: Google Gemma 3 1B IT
π» Production-Ready for Raspberry Pi
β Tested on Raspberry Pi 4 (4GB) β 3.32 t/s sustained (100% reliable over 60 minutes) β Fully offline β no external APIs, no internet required β Self-learning β EvoMemoryβ’ saves neurons from every conversation β Bilingual β seamlessly switches between Italian and English β 24/7 deployment tested β Zero failures in 60-minute soak test
π What Makes It Special
Unlike traditional quantized models, Antonio Gemma3 Evo Q4 learns and evolves:
- 𧬠EvoMemoryβ’ β Saves "neurons" with input, output, confidence, and mood
- π RAG-Lite β Retrieves past experiences using BM25 (no FAISS!)
- π― Self-evaluation β Assigns confidence scores (0-1) to every response
- π± Auto-evolution β Generates new reasoning rules from accumulated neurons
- π 100% Offline β Runs completely local on Raspberry Pi 4 (4GB RAM)
- π Bilingual β Auto-detects IT/EN and responds in the same language
- β‘ Fast β 3.32 tokens/s sustained on Pi 4 with Q4_K_M quantization
- π― NEW: Adaptive Prompting β Smart question classification (SIMPLE/COMPLEX/CODE/CREATIVE) for 3.6x speedup on simple queries
"The little brain that grows with you" π§
π Benchmark Results (Updated Oct 21, 2025)
Complete 60-minute soak test on Raspberry Pi 4 (4GB RAM) with Ollama.
Production Metrics
| Metric | Value | Status |
|---|---|---|
| Sustained throughput | 3.32 t/s (256 tokens) | β Production-ready |
| Reliability | 100% (455/455 requests) | β Perfect |
| Avg response time | 7.92s | β Consistent |
| Thermal stability | 70.2Β°C avg (max 73.5Β°C) | β No throttling |
| Memory usage | 42% (1.6 GB) | β No leaks |
| Uptime tested | 60+ minutes continuous | β 24/7 ready |
Performance by Token Count
| Tokens | Run 1 | Run 2 | Run 3 | Average* |
|---|---|---|---|---|
| 128 | 0.24 | 3.44 | 3.45 | 3.45 t/s |
| 256 | 3.43 | 3.09 | 3.43 | 3.32 t/s |
| 512 | 2.76 | 2.77 | 2.11 | 2.55 t/s |
*Average excludes cold-start (first run)
Model Comparison
| Model | Size | Sustained Speed | Recommended For |
|---|---|---|---|
| Q4_K_M β | 769 MB | 3.32 t/s | Production (tested 60min, 100% reliable) |
| Q4_0 | 687 MB | 3.45 t/s | Development (faster but less stable) |
π View Complete Benchmark Report β Full performance, reliability, and stability analysis
Benchmark Methodology
- Duration: 60.1 minutes (3,603 seconds)
- Total requests: 455 (2-second interval)
- Platform: Raspberry Pi 4 (4GB RAM, ARM Cortex-A72 @ 1.5GHz)
- Runtime: Ollama 0.3.x + llama.cpp backend
- Monitoring: CPU temp, RAM usage, load average (sampled every 5s)
- Tasks: Performance (128/256/512 tokens), Quality (HellaSwag, ARC, TruthfulQA), Robustness (soak test)
Recommendation: Use Q4_K_M for production deployments (proven 100% reliability over 60 minutes). Use Q4_0 for development/testing if you need slightly faster inference.
π§© Available Models
This repository contains two quantization variants:
- gemma3-1b-q4_0.gguf (β687 MB) β Faster, 3% higher throughput, suitable for development
- gemma3-1b-q4_k_m.gguf (β769 MB) β Better quality, production-tested for 60+ minutes
π― Important: Two Usage Modes
Mode 1: Ollama Only (Simple Inference) β‘
Download the GGUF model and run with Ollama:
ollama pull antconsales/antonio-gemma3-evo-q4
ollama run antconsales/antonio-gemma3-evo-q4
What you get:
- β Fast inference (3.32 t/s on Pi 4)
- β Bilingual chat (IT/EN)
- β Offline, privacy-first
- β NO EvoMemory (doesn't save conversations)
- β NO RAG (doesn't retrieve past experiences)
- β NO auto-evolution (doesn't generate rules)
Best for: Quick tests, one-off questions, simple chatbot
Mode 2: Full Evolution Stack (Self-Learning) π§
For EvoMemoryβ’, RAG-Lite, and auto-evolution, use the full Python stack from GitHub:
git clone https://github.com/antconsales/antonio-gemma3-evo-q4.git
cd antonio-gemma3-evo-q4
bash scripts/install.sh
uvicorn api.server:app --host 0.0.0.0 --port 8000
What you get:
- β EvoMemoryβ’ β Saves neurons from every conversation
- β RAG-Lite β Retrieves past experiences (BM25)
- β Auto-evolution β Generates reasoning rules over time
- β Confidence scoring β Knows when it's uncertain
- β FastAPI server β REST + WebSocket endpoints
Comparison:
| Feature | Ollama Only | Full Stack |
|---|---|---|
| Inference speed | 3.32 t/s | 3.32 t/s |
| Learns from chats | β | β EvoMemory |
| Retrieves memories | β | β RAG-Lite |
| Generates rules | β | β Auto-evolution |
| API endpoints | β | β FastAPI |
| Setup time | 1 min | 5 min |
π οΈ Quick Start Options
Option 1: Ollama Only (see Mode 1 above)
Option 2: Load Directly from GGUF
# Download model from HuggingFace
wget https://huggingface.co/chill123/antonio-gemma3-evo-q4/resolve/main/gemma3-1b-q4_k_m.gguf
# Create Modelfile
cat > Modelfile <<'EOF'
FROM ./gemma3-1b-q4_k_m.gguf
PARAMETER temperature 0.7
PARAMETER top_p 0.9
PARAMETER num_ctx 1024
PARAMETER num_thread 4
PARAMETER repeat_penalty 1.05
PARAMETER stop "<end_of_turn>"
PARAMETER stop "</s>"
SYSTEM """You are Antonio, an offline AI assistant running on a Raspberry Pi. You MUST detect the user's language and respond in the SAME language:
- If the user writes in Italian, respond ONLY in Italian
- If the user writes in English, respond ONLY in English
You are helpful, friendly, and concise. When you're uncertain, you admit it instead of guessing."""
EOF
# Create and run model
ollama create antonio-evo -f Modelfile
ollama run antonio-evo
π Quick Start with Full Evolution Stack
For the complete self-learning system with EvoMemoryβ’, RAG-Lite, and auto-evolution:
# Clone the full project
git clone https://github.com/antconsales/antonio-gemma3-evo-q4.git
cd antonio-gemma3-evo-q4
# Install and run
bash scripts/install.sh
python -m api.server
Visit: http://localhost:8000/docs for interactive API documentation
Features of the full stack:
- EvoMemoryβ’ SQLite database
- RAG-Lite with BM25 search
- Confidence auto-evaluation
- Rule regeneration (auto-evolution)
- FastAPI server with WebSocket support
- MCP-compatible tool system
π‘ Key Features
1οΈβ£ EvoMemoryβ’ β Living Memory
Every conversation creates a neuron:
{
"id": 123,
"input_text": "Accendi il LED rosso",
"output_text": "OK, attivo GPIO 17 su HIGH",
"confidence": 0.85,
"mood": "positive",
"user_feedback": 1,
"skill_id": "gpio_control",
"timestamp": "2025-10-21T14:30:00Z"
}
Features:
- Auto-pruning of low-confidence old neurons
- Neuron compression (similar patterns β meta-neurons)
- Context-aware retrieval via hash matching
2οΈβ£ RAG-Lite β Pure Python BM25
No FAISS, no ChromaDB. Just:
- BM25 scoring (Okapi formula)
- SQLite full-text search (FTS5)
- Top-K retrieval with confidence threshold
- Zero external dependencies
3οΈβ£ Auto-Evolution β Rule Regeneration
Every N conversations (configurable), Antonio:
- Analyzes high-confidence neurons
- Extracts reasoning patterns
- Generates new rules (e.g., "If user asks time-sensitive question β check recency")
- Saves rules to
instinct.json
ποΈ Architecture
βββββββββββββββββββββββββββββββββββββββββββββββ
β Antonio Gemma3 Evo Q4 - Evolution Layer β
βββββββββββββββββββββββββββββββββββββββββββββββ€
β β
β ββββββββββββββββ βββββββββββββββββββ β
β β EvoMemoryβ’ βββββββΊβ RAG-Lite β β
β β (SQLite) β β BM25 Search β β
β ββββββββββββββββ βββββββββββββββββββ β
β β² β² β
β β β β
β ββββββββ΄βββββββββββββββββββββββ΄βββββββββββ β
β β Inference Engine (llama.cpp) β β
β β β’ Q4_0 / Q4_K_M β β
β β β’ Optimized for Pi 4 β β
β ββββββββββββββββββββββββββββββββββββββββββ β
β β² β β
β β βΌ β
β ββββββββ΄ββββββββββ ββββββββββββββββββββ β
β β Action Broker β β Confidence β β
β β (MCP-ready) β β Auto-Eval β β
β ββββββββββββββββββ ββββββββββββββββββββ β
β β
β FastAPI Server (REST + WebSocket) β
βββββββββββββββββββββββββββββββββββββββββββββββ
π― Use Cases
Recommended for:
- β Home AI assistants (24/7 operation)
- β IoT edge inference (low power budget)
- β Offline chatbots (privacy-first)
- β Educational projects (affordable hardware)
- β Voice assistants (bilingual IT/EN)
- β Self-learning experiments (neuron/rule evolution)
Not recommended for:
- β Real-time applications (<500ms latency)
- β Batch processing (CPU-bound, single-threaded)
- β High concurrency (>5 simultaneous users)
π Links
- GitHub (Full Stack): https://github.com/antconsales/antonio-gemma3-evo-q4
- Ollama: https://ollama.com/antconsales/antonio-gemma3-evo-q4
- HuggingFace: https://huggingface.co/chill123/antonio-gemma3-evo-q4
- Donate: https://www.paypal.com/donate/?business=58ML44FNPK66Y¤cy_code=EUR
π License
This model is licensed under the Gemma License Agreement (inherits from base model).
The evolution stack code (EvoMemoryβ’, RAG-Lite, etc.) is dual-licensed:
- Gemma License (model weights)
- MIT License (Python code)
See LICENSE for details.
Built with β€οΈ for offline AI and edge computing
"Il piccolo cervello che cresce insieme a te" β Antonio Gemma3 Evo
Support ethical, local, and independent AI. Every donation helps Antonio Gemma grow and evolve. π
- Downloads last month
- 4
4-bit