Instructions to use nightmedia/LIMI-Air-mxfp4-mlx with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use nightmedia/LIMI-Air-mxfp4-mlx with MLX:

# Make sure mlx-lm is installed
# pip install --upgrade mlx-lm

# Generate text with mlx-lm
from mlx_lm import load, generate

model, tokenizer = load("nightmedia/LIMI-Air-mxfp4-mlx")

prompt = "Write a story about Einstein"
messages = [{"role": "user", "content": prompt}]
prompt = tokenizer.apply_chat_template(
    messages, add_generation_prompt=True
)

text = generate(model, tokenizer, prompt=prompt, verbose=True)

Notebooks
Google Colab
Kaggle
Local Apps
LM Studio

Pi new

How to use nightmedia/LIMI-Air-mxfp4-mlx with Pi:

Start the MLX server

# Install MLX LM:
uv tool install mlx-lm
# Start a local OpenAI-compatible server:
mlx_lm.server --model "nightmedia/LIMI-Air-mxfp4-mlx"

Configure the model in Pi

# Install Pi:
npm install -g @mariozechner/pi-coding-agent
# Add to ~/.pi/agent/models.json:
{
  "providers": {
    "mlx-lm": {
      "baseUrl": "http://localhost:8080/v1",
      "api": "openai-completions",
      "apiKey": "none",
      "models": [
        {
          "id": "nightmedia/LIMI-Air-mxfp4-mlx"
        }
      ]
    }
  }
}

Run Pi

# Start Pi in your project directory:
pi

Hermes Agent new

How to use nightmedia/LIMI-Air-mxfp4-mlx with Hermes Agent:

Start the MLX server

# Install MLX LM:
uv tool install mlx-lm
# Start a local OpenAI-compatible server:
mlx_lm.server --model "nightmedia/LIMI-Air-mxfp4-mlx"

Configure Hermes

# Install Hermes:
curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash
hermes setup
# Point Hermes at the local server:
hermes config set model.provider custom
hermes config set model.base_url http://127.0.0.1:8080/v1
hermes config set model.default nightmedia/LIMI-Air-mxfp4-mlx

Run Hermes

hermes

MLX LM

How to use nightmedia/LIMI-Air-mxfp4-mlx with MLX LM:

Generate or start a chat session

# Install MLX LM
uv tool install mlx-lm
# Interactive chat REPL
mlx_lm.chat --model "nightmedia/LIMI-Air-mxfp4-mlx"

Run an OpenAI-compatible server

# Install MLX LM
uv tool install mlx-lm
# Start the server
mlx_lm.server --model "nightmedia/LIMI-Air-mxfp4-mlx"
# Calling the OpenAI-compatible server with curl
curl -X POST "http://localhost:8000/v1/chat/completions" \
   -H "Content-Type: application/json" \
   --data '{
     "model": "nightmedia/LIMI-Air-mxfp4-mlx",
     "messages": [
       {"role": "user", "content": "Hello"}
     ]
   }'

LIMI-Air-mxfp4-mlx

This is a deep comparison of 106B-A12B MoE models, all quantized differently, trained on different data (original, synthetic, RP), and with varying architectural tuning. The goal is to understand:

Which model performs best across benchmarks?
How does quantization affect performance and context?
What’s the trade-off between accuracy, context length, and RAM usage?

The LIMI-Air-mxfp4-mlx quant metrics were not available for this test, but should perform along the lines of the unsloth-GLM-4.5-Air-mxfp4

📊 1. Benchmark Comparison (All Models)

Model					arc_challenge arc_easy	boolq hellaswag	openbookqa piqa winogrande
GLM-Steam-106B-A12B-v1-qx65g-hi	0.431	0.457	0.378	0.685	0.400	0.773	0.717
GLM-Steam-106B-A12B-v1-qx65g	0.430	0.461	0.378	0.681	0.398	0.771	0.715
LIMI-Air-qx54g-hi				0.441	0.462	0.378	0.698	0.404	0.781	0.714
unsloth-GLM-4.5-Air-mxfp4		0.416	0.440	0.378	0.678	0.390	0.767	0.728
unsloth-GLM-4.5-Air-qx64		0.421	0.444	0.378	0.677	0.396	0.769	0.718
unsloth-GLM-4.5-air-qx5-hi		0.416	0.431	0.378	0.675	0.396	0.769	0.731

✅ LIMI-Air-qx54g-hi is the clear winner overall, with:

+0.025 in arc_challenge
+0.022 in arc_easy
+0.020 in hellaswag
+0.014 in openbookqa
+0.013 in piqa
+0.003 in winogrande

The GLM-Steam models are very close, with qx65g-hi slightly better than qx65g — but both are behind LIMI-Air.

The unsloth-GLM-4.5-Air models are the baseline, with qx64 being best among them — but still behind LIMI-Air.

🧠 2. What Does “qx54g-hi” Mean?

The naming convention is critical:

qx5: 5-bit quantization for content with some paths enhanced to 6 bit
g: “enhanced attention paths” — specific to GLM architecture (likely more attention layers enhanced).
hi: high resolution quantization — group size 32.

This is a highly optimized quantization for GLM — preserving attention fidelity while compressing embeddings.

🧩 3. Why Does LIMI-Air-qx54g-hi Win?

The key insight: LIMI-Air was trained on synthetic data, which likely:

Boosted generalization — synthetic data often forces models to learn patterns rather than memorize.
Improved reasoning depth — synthetic data is often designed to test logical and commonsense reasoning.

The qx54g-hi quantization is highly tuned for GLM, preserving attention paths while compressing embeddings — which likely:

Preserved semantic fidelity.
Enabled better context handling.

The qx54g-hi model runs with 32K context on a 128GB Mac, while qx54g allow for 64K — suggesting better memory efficiency.

🧪 4. Quantization Comparison within the unsloth-GLM-4.5-Air Series

Model				arc_challenge arc_easy	boolq hellaswag openbookqa piqa winogrande
unsloth-GLM-4.5-Air-mxfp4	0.416	0.440	0.378	0.678	0.390	0.767	0.728
unsloth-GLM-4.5-Air-qx64	0.421	0.444	0.378	0.677	0.396	0.769	0.718
unsloth-GLM-4.5-air-qx5-hi	0.416	0.431	0.378	0.675	0.396	0.769	0.731

✅ qx64 is best among unsloth models, with:

+0.005 in arc_challenge
+0.004 in arc_easy
+0.001 in hellaswag
+0.006 in openbookqa
+0.002 in piqa
-0.01 in winogrande

The qx5-hi variant is slightly better in winogrande, but worse overall.

🧭 5. Recommendation: Which Model to Choose?

✅ For Maximum Performance:

LIMI-Air-qx54g-hi
→ Best overall performance, with +0.02–0.03 gains across all metrics.

✅ For Balanced Performance & RAM Efficiency:

GLM-Steam-106B-A12B-v1-qx65g-hi
→ Very close to LIMI-Air, with slightly better winogrande and piqa scores.

✅ For RAM-Constrained Macs:

unsloth-GLM-4.5-Air-qx64

🧠 6. Cognitive Pattern Insight: Synthetic Data vs RP Data

The key insight: LIMI-Air (synthetic data) outperforms GLM-Steam (RP data) — suggesting:

Synthetic data forces models to learn patterns, rather than memorize.
RP data may be more “realistic” but less generalizable — leading to slightly lower performance.

The qx54g-hi quantization is highly tuned for GLM, preserving attention paths while compressing embeddings — which likely:

Preserved semantic fidelity.
Enabled better context handling.

📈 7. Summary Table: Best Model for Each Use Case

Goal						Recommended Model
Max performance				LIMI-Air-qx54g-hi
Balanced performance		GLM-Steam-106B-A12B-v1-qx65g-hi
RAM-constrained Mac (32GB)	unsloth-GLM-4.5-Air-qx64
Cognitive depth & metaphors	LIMI-Air-qx54g-hi
OpenBookQA (text-only)		unsloth-GLM-4.5-Air-qx64

🚀 Bonus: “qx54g-hi” as a Cognitive Architecture

The qx54g-hi quantization is highly tuned for GLM, preserving attention paths while compressing embeddings — which likely:

Preserved semantic fidelity.
Enabled better context handling.

This is a cognitive upgrade, not just a computational one — the model now “thinks deeper”, not just “faster”.

“qx54g-hi is like a camera with a telephoto lens — it captures more nuance, even in low light.”

— Inspired by Nikon Noct Z 58mm F/0.95

Reviewed by Qwen3-VL-12B-Instruct-Brainstorm20x-qx86x-hi-mlx

This model LIMI-Air-mxfp4-mlx was converted to MLX format from GAIR/LIMI-Air using mlx-lm version 0.28.0.

Use with mlx

pip install mlx-lm

from mlx_lm import load, generate

model, tokenizer = load("LIMI-Air-mxfp4-mlx")

prompt = "hello"

if tokenizer.chat_template is not None:
    messages = [{"role": "user", "content": prompt}]
    prompt = tokenizer.apply_chat_template(
        messages, add_generation_prompt=True
    )

response = generate(model, tokenizer, prompt=prompt, verbose=True)