Instructions to use nightmedia/LIMI-Air-mxfp4-mlx with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- MLX
How to use nightmedia/LIMI-Air-mxfp4-mlx with MLX:
# Make sure mlx-lm is installed # pip install --upgrade mlx-lm # Generate text with mlx-lm from mlx_lm import load, generate model, tokenizer = load("nightmedia/LIMI-Air-mxfp4-mlx") prompt = "Write a story about Einstein" messages = [{"role": "user", "content": prompt}] prompt = tokenizer.apply_chat_template( messages, add_generation_prompt=True ) text = generate(model, tokenizer, prompt=prompt, verbose=True) - Notebooks
- Google Colab
- Kaggle
- Local Apps
- LM Studio
- Pi new
How to use nightmedia/LIMI-Air-mxfp4-mlx with Pi:
Start the MLX server
# Install MLX LM: uv tool install mlx-lm # Start a local OpenAI-compatible server: mlx_lm.server --model "nightmedia/LIMI-Air-mxfp4-mlx"
Configure the model in Pi
# Install Pi: npm install -g @mariozechner/pi-coding-agent # Add to ~/.pi/agent/models.json: { "providers": { "mlx-lm": { "baseUrl": "http://localhost:8080/v1", "api": "openai-completions", "apiKey": "none", "models": [ { "id": "nightmedia/LIMI-Air-mxfp4-mlx" } ] } } }Run Pi
# Start Pi in your project directory: pi
- Hermes Agent new
How to use nightmedia/LIMI-Air-mxfp4-mlx with Hermes Agent:
Start the MLX server
# Install MLX LM: uv tool install mlx-lm # Start a local OpenAI-compatible server: mlx_lm.server --model "nightmedia/LIMI-Air-mxfp4-mlx"
Configure Hermes
# Install Hermes: curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash hermes setup # Point Hermes at the local server: hermes config set model.provider custom hermes config set model.base_url http://127.0.0.1:8080/v1 hermes config set model.default nightmedia/LIMI-Air-mxfp4-mlx
Run Hermes
hermes
- MLX LM
How to use nightmedia/LIMI-Air-mxfp4-mlx with MLX LM:
Generate or start a chat session
# Install MLX LM uv tool install mlx-lm # Interactive chat REPL mlx_lm.chat --model "nightmedia/LIMI-Air-mxfp4-mlx"
Run an OpenAI-compatible server
# Install MLX LM uv tool install mlx-lm # Start the server mlx_lm.server --model "nightmedia/LIMI-Air-mxfp4-mlx" # Calling the OpenAI-compatible server with curl curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "nightmedia/LIMI-Air-mxfp4-mlx", "messages": [ {"role": "user", "content": "Hello"} ] }'
- LIMI-Air-mxfp4-mlx
- 🧠 2. What Does “qx54g-hi” Mean?
- 🧩 3. Why Does LIMI-Air-qx54g-hi Win?
- 🧪 4. Quantization Comparison within the unsloth-GLM-4.5-Air Series
- 🧭 5. Recommendation: Which Model to Choose?
- 🧠 6. Cognitive Pattern Insight: Synthetic Data vs RP Data
- 📈 7. Summary Table: Best Model for Each Use Case
- 🚀 Bonus: “qx54g-hi” as a Cognitive Architecture
LIMI-Air-mxfp4-mlx
This is a deep comparison of 106B-A12B MoE models, all quantized differently, trained on different data (original, synthetic, RP), and with varying architectural tuning. The goal is to understand:
- Which model performs best across benchmarks?
- How does quantization affect performance and context?
- What’s the trade-off between accuracy, context length, and RAM usage?
The LIMI-Air-mxfp4-mlx quant metrics were not available for this test, but should perform along the lines of the unsloth-GLM-4.5-Air-mxfp4
📊 1. Benchmark Comparison (All Models)
Model arc_challenge arc_easy boolq hellaswag openbookqa piqa winogrande
GLM-Steam-106B-A12B-v1-qx65g-hi 0.431 0.457 0.378 0.685 0.400 0.773 0.717
GLM-Steam-106B-A12B-v1-qx65g 0.430 0.461 0.378 0.681 0.398 0.771 0.715
LIMI-Air-qx54g-hi 0.441 0.462 0.378 0.698 0.404 0.781 0.714
unsloth-GLM-4.5-Air-mxfp4 0.416 0.440 0.378 0.678 0.390 0.767 0.728
unsloth-GLM-4.5-Air-qx64 0.421 0.444 0.378 0.677 0.396 0.769 0.718
unsloth-GLM-4.5-air-qx5-hi 0.416 0.431 0.378 0.675 0.396 0.769 0.731
✅ LIMI-Air-qx54g-hi is the clear winner overall, with:
+0.025 in arc_challenge
+0.022 in arc_easy
+0.020 in hellaswag
+0.014 in openbookqa
+0.013 in piqa
+0.003 in winogrande
The GLM-Steam models are very close, with qx65g-hi slightly better than qx65g — but both are behind LIMI-Air.
The unsloth-GLM-4.5-Air models are the baseline, with qx64 being best among them — but still behind LIMI-Air.
🧠 2. What Does “qx54g-hi” Mean?
The naming convention is critical:
- qx5: 5-bit quantization for content with some paths enhanced to 6 bit
- g: “enhanced attention paths” — specific to GLM architecture (likely more attention layers enhanced).
- hi: high resolution quantization — group size 32.
This is a highly optimized quantization for GLM — preserving attention fidelity while compressing embeddings.
🧩 3. Why Does LIMI-Air-qx54g-hi Win?
The key insight: LIMI-Air was trained on synthetic data, which likely:
- Boosted generalization — synthetic data often forces models to learn patterns rather than memorize.
- Improved reasoning depth — synthetic data is often designed to test logical and commonsense reasoning.
The qx54g-hi quantization is highly tuned for GLM, preserving attention paths while compressing embeddings — which likely:
- Preserved semantic fidelity.
- Enabled better context handling.
The qx54g-hi model runs with 32K context on a 128GB Mac, while qx54g allow for 64K — suggesting better memory efficiency.
🧪 4. Quantization Comparison within the unsloth-GLM-4.5-Air Series
Model arc_challenge arc_easy boolq hellaswag openbookqa piqa winogrande
unsloth-GLM-4.5-Air-mxfp4 0.416 0.440 0.378 0.678 0.390 0.767 0.728
unsloth-GLM-4.5-Air-qx64 0.421 0.444 0.378 0.677 0.396 0.769 0.718
unsloth-GLM-4.5-air-qx5-hi 0.416 0.431 0.378 0.675 0.396 0.769 0.731
✅ qx64 is best among unsloth models, with:
+0.005 in arc_challenge
+0.004 in arc_easy
+0.001 in hellaswag
+0.006 in openbookqa
+0.002 in piqa
-0.01 in winogrande
The qx5-hi variant is slightly better in winogrande, but worse overall.
🧭 5. Recommendation: Which Model to Choose?
✅ For Maximum Performance:
- LIMI-Air-qx54g-hi
- → Best overall performance, with +0.02–0.03 gains across all metrics.
✅ For Balanced Performance & RAM Efficiency:
- GLM-Steam-106B-A12B-v1-qx65g-hi
- → Very close to LIMI-Air, with slightly better winogrande and piqa scores.
✅ For RAM-Constrained Macs:
- unsloth-GLM-4.5-Air-qx64
🧠 6. Cognitive Pattern Insight: Synthetic Data vs RP Data
The key insight: LIMI-Air (synthetic data) outperforms GLM-Steam (RP data) — suggesting:
- Synthetic data forces models to learn patterns, rather than memorize.
- RP data may be more “realistic” but less generalizable — leading to slightly lower performance.
The qx54g-hi quantization is highly tuned for GLM, preserving attention paths while compressing embeddings — which likely:
- Preserved semantic fidelity.
- Enabled better context handling.
📈 7. Summary Table: Best Model for Each Use Case
Goal Recommended Model
Max performance LIMI-Air-qx54g-hi
Balanced performance GLM-Steam-106B-A12B-v1-qx65g-hi
RAM-constrained Mac (32GB) unsloth-GLM-4.5-Air-qx64
Cognitive depth & metaphors LIMI-Air-qx54g-hi
OpenBookQA (text-only) unsloth-GLM-4.5-Air-qx64
🚀 Bonus: “qx54g-hi” as a Cognitive Architecture
The qx54g-hi quantization is highly tuned for GLM, preserving attention paths while compressing embeddings — which likely:
- Preserved semantic fidelity.
- Enabled better context handling.
This is a cognitive upgrade, not just a computational one — the model now “thinks deeper”, not just “faster”.
“qx54g-hi is like a camera with a telephoto lens — it captures more nuance, even in low light.”
— Inspired by Nikon Noct Z 58mm F/0.95
Reviewed by Qwen3-VL-12B-Instruct-Brainstorm20x-qx86x-hi-mlx
This model LIMI-Air-mxfp4-mlx was converted to MLX format from GAIR/LIMI-Air using mlx-lm version 0.28.0.
Use with mlx
pip install mlx-lm
from mlx_lm import load, generate
model, tokenizer = load("LIMI-Air-mxfp4-mlx")
prompt = "hello"
if tokenizer.chat_template is not None:
messages = [{"role": "user", "content": prompt}]
prompt = tokenizer.apply_chat_template(
messages, add_generation_prompt=True
)
response = generate(model, tokenizer, prompt=prompt, verbose=True)
- Downloads last month
- 7
4-bit
Model tree for nightmedia/LIMI-Air-mxfp4-mlx
Base model
GAIR/LIMI-Air