Instructions to use prism-ml/Bonsai-8B-gguf with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use prism-ml/Bonsai-8B-gguf with llama-cpp-python:

# !pip install llama-cpp-python

from llama_cpp import Llama

llm = Llama.from_pretrained(
	repo_id="prism-ml/Bonsai-8B-gguf",
	filename="Bonsai-8B-Q1_0.gguf",
)

llm.create_chat_completion(
	messages = [
		{
			"role": "user",
			"content": "What is the capital of France?"
		}
	]
)

Notebooks
Google Colab
Kaggle
Local Apps

llama.cpp

How to use prism-ml/Bonsai-8B-gguf with llama.cpp:

Install from brew

brew install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf prism-ml/Bonsai-8B-gguf:Q1_0
# Run inference directly in the terminal:
llama-cli -hf prism-ml/Bonsai-8B-gguf:Q1_0

Install from WinGet (Windows)

winget install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf prism-ml/Bonsai-8B-gguf:Q1_0
# Run inference directly in the terminal:
llama-cli -hf prism-ml/Bonsai-8B-gguf:Q1_0

Use pre-built binary

# Download pre-built binary from:
# https://github.com/ggerganov/llama.cpp/releases
# Start a local OpenAI-compatible server with a web UI:
./llama-server -hf prism-ml/Bonsai-8B-gguf:Q1_0
# Run inference directly in the terminal:
./llama-cli -hf prism-ml/Bonsai-8B-gguf:Q1_0

Build from source code

git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
cmake -B build
cmake --build build -j --target llama-server llama-cli
# Start a local OpenAI-compatible server with a web UI:
./build/bin/llama-server -hf prism-ml/Bonsai-8B-gguf:Q1_0
# Run inference directly in the terminal:
./build/bin/llama-cli -hf prism-ml/Bonsai-8B-gguf:Q1_0

Use Docker

docker model run hf.co/prism-ml/Bonsai-8B-gguf:Q1_0

LM Studio
Jan

vLLM

How to use prism-ml/Bonsai-8B-gguf with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "prism-ml/Bonsai-8B-gguf"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "prism-ml/Bonsai-8B-gguf",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/prism-ml/Bonsai-8B-gguf:Q1_0

Ollama
How to use prism-ml/Bonsai-8B-gguf with Ollama:
```
ollama run hf.co/prism-ml/Bonsai-8B-gguf:Q1_0
```

Unsloth Studio new

How to use prism-ml/Bonsai-8B-gguf with Unsloth Studio:

Install Unsloth Studio (macOS, Linux, WSL)

curl -fsSL https://unsloth.ai/install.sh | sh
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for prism-ml/Bonsai-8B-gguf to start chatting

Install Unsloth Studio (Windows)

irm https://unsloth.ai/install.ps1 | iex
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for prism-ml/Bonsai-8B-gguf to start chatting

Using HuggingFace Spaces for Unsloth

# No setup required
# Open https://huggingface.co/spaces/unsloth/studio in your browser
# Search for prism-ml/Bonsai-8B-gguf to start chatting

Pi new

How to use prism-ml/Bonsai-8B-gguf with Pi:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama-server -hf prism-ml/Bonsai-8B-gguf:Q1_0

Configure the model in Pi

# Install Pi:
npm install -g @mariozechner/pi-coding-agent
# Add to ~/.pi/agent/models.json:
{
  "providers": {
    "llama-cpp": {
      "baseUrl": "http://localhost:8080/v1",
      "api": "openai-completions",
      "apiKey": "none",
      "models": [
        {
          "id": "prism-ml/Bonsai-8B-gguf:Q1_0"
        }
      ]
    }
  }
}

Run Pi

# Start Pi in your project directory:
pi

Hermes Agent new

How to use prism-ml/Bonsai-8B-gguf with Hermes Agent:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama-server -hf prism-ml/Bonsai-8B-gguf:Q1_0

Configure Hermes

# Install Hermes:
curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash
hermes setup
# Point Hermes at the local server:
hermes config set model.provider custom
hermes config set model.base_url http://127.0.0.1:8080/v1
hermes config set model.default prism-ml/Bonsai-8B-gguf:Q1_0

Run Hermes

hermes

Docker Model Runner
How to use prism-ml/Bonsai-8B-gguf with Docker Model Runner:
```
docker model run hf.co/prism-ml/Bonsai-8B-gguf:Q1_0
```

Lemonade

How to use prism-ml/Bonsai-8B-gguf with Lemonade:

Pull the model

# Download Lemonade from https://lemonade-server.ai/
lemonade pull prism-ml/Bonsai-8B-gguf:Q1_0

Run and chat with the model

lemonade run user.Bonsai-8B-gguf-Q1_0

List all available models

lemonade list

Prism ML Website | Whitepaper | Demo & Examples | Colab Notebook | Discord

Bonsai-8B-GGUF-1bit

End-to-end 1-bit language model for llama.cpp (CUDA, Metal, CPU)

14.1x smaller than FP16 | 6.2x faster on RTX 4090 | 4-5x lower energy/token

Highlights

1.15 GB parameter memory (down from 16.38 GB FP16) — fits on virtually any device with a GPU
End-to-end 1-bit weights across embeddings, attention projections, MLP projections, and LM head
GGUF Q1_0 (g128) format with inline dequantization kernels — no FP16 materialization
Cross-platform: CUDA (RTX/datacenter), Metal (Mac), Android, CPU
Competitive benchmarks: 70.5 avg score across 6 categories, matching full-precision 8B models at 1/14th the size
MLX companion: also available as MLX 1-bit g128 for native Apple Silicon inference

Frontier Efficiency

Resources

Google Colab — try Bonsai in your browser, no setup required
Whitepaper — for more details on Bonsai, check out our whitepaper
Demo repo — comprehensive examples for serving, benchmarking, and integrating Bonsai
Discord — join the community for support, discussion, and updates
1-bit kernels: llama.cpp fork (CUDA + Metal) · MLX fork (Apple Silicon) · mlx-swift fork (iOS/macOS)
Locally AI — we have partnered with Locally AI for iPhone support

Model Overview

Item	Specification
Parameters	8.19B (~6.95B non-embedding)
Architecture	Qwen3-8B dense: GQA (32 query / 8 KV heads), SwiGLU MLP, RoPE, RMSNorm
Layers	36 Transformer decoder blocks
Context length	65,536 tokens
Vocab size	151,936
Weight format	GGUF Q1_0
Deployed size	1.15 GB (14.2x smaller than FP16)
1-bit coverage	Embeddings, attention projections, MLP projections, LM head
License	Apache 2.0

Quantization Format: Q1_0

Each weight is a single bit: 0 maps to −scale, 1 maps to +scale. Every group of 128 weights shares one FP16 scale factor.

Effective bits per weight: 1.125 (1 sign bit + 16-bit scale amortized over 128 weights).

Memory Requirement

Parameter memory only (weights and scales loaded into memory):

Format	Size	Reduction	Ratio
FP16	16.38 GB	—	1.0x
GGUF Q1_0	1.15 GB	93.0%	14.2x
MLX 1-bit g128	1.28 GB	92.2%	12.8x

The GGUF file on disk is 1.16 GB (~6.6 MB larger) because the format embeds the tokenizer, chat template, and model metadata alongside the weights.

Best Practices

Generation Parameters

Parameter	Default	Suggested range
Temperature	0.5	0.5 -- 0.7
Top-k	20	20 -- 40
Top-p	0.9	0.85 -- 0.95
Repetition penalty	1.0
Presence penalty	0.0

System Prompt

You can use a simple system prompt such as:

You are a helpful assistant

Quickstart

llama.cpp (CUDA)

# Clone the PrismML fork of llama.cpp (includes Q1_0 kernels)
git clone https://github.com/PrismML-Eng/llama.cpp
cd llama.cpp

# Build with CUDA support
cmake -B build -DGGML_CUDA=ON && cmake --build build -j

# Run inference
./build/bin/llama-cli \
    -m Bonsai-8B-Q1_0.gguf \
    -p "Explain quantum computing in simple terms." \
    -n 256 \
    --temp 0.5 \
    --top-p 0.85 \
    --top-k 20 \
    -ngl 99

llama.cpp (Metal / macOS)

# Clone the PrismML fork of llama.cpp (includes Q1_0 kernels)
git clone https://github.com/PrismML-Eng/llama.cpp
cd llama.cpp

# Build with Metal support (default on macOS)
cmake -B build && cmake --build build -j

# Run inference
./build/bin/llama-cli \
    -m Bonsai-8B-Q1_0.gguf \
    -p "Explain quantum computing in simple terms." \
    -n 256 \
    --temp 0.5 \
    --top-p 0.85 \
    --top-k 20 \
    -ngl 99

llama.cpp Server

./build/bin/llama-server \
    -m Bonsai-8B-Q1_0.gguf \
    --host 0.0.0.0 \
    --port 8080 \
    -ngl 99

Open the web UI at http://127.0.0.1:8080, or see our llama.cpp fork for more examples.

Cross-Platform Throughput

Platform	Backend	TG128 (tok/s)	FP16 TG (tok/s)	TG vs FP16	PP512 (tok/s)	FP16 PP512 (tok/s)
RTX 4090	llama.cpp CUDA	368	59	6.2x	11,809	10,453
RTX L40S	llama.cpp CUDA	327	52	6.3x	9,592	8,325
RTX 3060 Laptop	llama.cpp CUDA	81	3.5¹	23x¹	1,871	94¹
M4 Pro 48 GB	llama.cpp Metal	85	16	5.4x	498	490
Samsung S25 Ultra	llama.cpp OpenCL	19.6	—	—	30.4	—

¹ FP16 only fits partially on GPU's 6 GB VRAM; 1-bit fits entirely in VRAM.

Cross-platform throughput

Energy Efficiency

Platform	Bonsai E_tg (mWh/tok)	Baseline E_tg	Advantage
RTX 4090 (CUDA)	0.276	1.134 (FP16)	4.1x
Mac M4 Pro (Metal)	0.091	0.471 (FP16)	5.1x

Energy efficiency

Benchmarks

Evaluated with EvalScope v1.4.2 + vLLM 0.15.1 on NVIDIA H100 under identical infrastructure, generation parameters, and scoring. All models are in the 6B–9B parameter range.

Model	Company	Size	Avg	MMLU-R	MuSR	GSM8K	HE+	IFEval	BFCL
Qwen 3 8B	Alibaba	16 GB	79.3	83	55	93	82.3	84.2	81
RNJ 8B	EssentialAI	16 GB	73.1	75.5	50.4	93.7	84.2	73.8	61.1
Mistral3 8B	Mistral	16 GB	71.0	73.9	53.8	87.2	67.4	75.4	45.4
Olmo 3 7B	Allen Inst	14 GB	70.9	72	56.1	92.5	79.3	37.1	38.4
1-bit Bonsai 8B	PrismML	1.15 GB	70.5	65.7	50	88	73.8	79.8	65.7
LFM2 8B	LiquidAI	16 GB	69.6	72.7	49.5	90.1	81	82.2	62.0
Llama 3.1 8B	Meta	16 GB	67.1	72.9	51.3	87.9	75	51.5	—
GLM v6 9B	ZhipuAI	16 GB	65.7	61.9	43.2	93.4	78.7	69.3	21.9
Hermes 8B	Nous Research	16 GB	65.4	67.4	52.2	82.9	51.2	65	73.5
Trinity Nano 6B	Arcee	12 GB	61.2	68.8	52.6	81.1	54	50	62.5
Marin 8B	Stanford CRFM	16 GB	56.6	64.8	42.6	86.4	51	50	—
R1-D 7B	DeepSeek	14 GB	55.1	62.5	29.1	92.7	81.7	48.8	15.4

Despite being 1/14th the size, 1-bit Bonsai 8B is competitive with leading full-precision 8B instruct models.

Intelligence Density

Intelligence density captures the ratio of a model's capability to its deployed size:

alpha = -ln(1 - score/100) / size_GB

Model	Size	Intelligence Density (1/GB)
1-bit Bonsai 8B	1.15 GB	1.062
Qwen 3 8B	16 GB	0.098
Llama 3.1 8B	16 GB	0.074
Mistral3 8B	16 GB	0.077

Bonsai 8B achieves 10.8x higher intelligence density than full-precision Qwen 3 8B.

Intelligence density

Use Cases

On-device assistants: interactive AI on laptops and phones with low latency
Mobile deployment: runs on a wide variety of phones due to low memory footprint
Edge robotics and autonomy: compact deployment on devices with thermal, memory, or connectivity constraints
Cost-sensitive GPU serving: higher throughput and lower energy per token on RTX-class and datacenter GPUs
Enterprise and private inference: local or controlled-environment inference for data residency requirements

Limitations

No native 1-bit hardware exists yet — current gains are software-kernel optimizations on general-purpose hardware
Mobile power measurement is estimated rather than hardware-metered
The full-precision benchmark frontier continues to advance; the 1-bit methodology is architecture-agnostic and will be applied to newer bases

Citation

If you use 1-bit Bonsai 8B, please cite:

@techreport{bonsai8b,
    title   = {1-bit Bonsai 8B: End-to-End 1-bit Language Model Deployment
               Across Apple, GPU, and Mobile Runtimes},
    author  = {Prism ML},
    year    = {2026},
    month   = {March},
    url     = {https://prismml.com}
}