DeepSeek-V4-Flash REAP25 REAPDataset10K-Balanced DS4 GGUF

Experimental DS4 compact GGUF made by applying 25% REAP expert pruning to a DeepSeek-V4-Flash DS4 GGUF, calibrated on 10,000 language-balanced prompts drawn from 8 domains of the REAP dataset.

Model file:

DeepSeek-V4-Flash-REAP25-REAPDataset10K-Balanced-DS4-compact-IQ2XXS.gguf

Bundled runtime:

ds4_reap_runtime/

Expert observation results:

reap_dataset_10k_balanced_seed42_reap25_experts.csv

Compatibility

This model needs the bundled REAP-aware DS4 runtime, or another DS4 build that supports ds4-compact-v1.

It is not expected to run with stock DS4, llama.cpp, Ollama, LM Studio, or other generic GGUF loaders. The routed expert tensors are physically compacted, so the runtime must read the REAP metadata and route into compact expert ids.

Expected DS4 runtime line:

REAP runtime metadata enabled: hash_preserved=3 router_masked=40 moe_disabled=0 layout=ds4-compact-v1

How It Was Made

Source GGUF

DeepSeek-V4-Flash-IQ2XXS-w2Q2K-AProjQ8-SExpQ8-OutQ8-chat-v2-imatrix.gguf

Calibration Dataset

Category Source Dataset Samples EN KO
mixture/code open-r1/codeforces-cots 2,000 1,000 1,000
mixture/math open-r1/OpenR1-Math-220k 2,000 1,000 1,000
mixture/science nvidia/Llama-Nemotron-Post-Training-Dataset 2,000 1,000 1,000
xlam/function-calling Salesforce/xlam-function-calling-60k 2,000 1,000 1,000
SWE/tool SWE-bench style (tool-use split) 500 250 250
SWE/xml SWE-bench style (XML format split) 500 250 250
SWE/ticks SWE-bench style (tick-format split) 500 250 250
SWE/train SWE-bench style (training split) 500 250 250
Total 10,000 5,000 5,000
  • Sampling: random with seed 42
  • Language balance: --balance-language enforced 50% English / 50% Korean per source category
  • Total token coverage: 27,592,731 observed prompt tokens
  • Observed expert route selections: 7,118,924,598

Observation

  • Seed: 42
  • Context length: 4,096
  • Chunk size: 100 prompts per chunk (100 chunks total, resumable)
  • Score metric: activation_energy_sum2

Pruning

  • Layers 0–2: preserved, hash-routed
  • Layers 3–42: REAP-pruned
  • Compression ratio: 0.25
  • Experts per pruned layer: 256 → 192 (64 pruned per layer)
  • Top-k remains 6
  • Layout: ds4-compact-v1
  • Expert tensor bytes are copied directly, preserving source quantization

Size

source file: 80.76 GiB / 86.72 GB
REAP25 file: 63.87 GiB / 68.58 GB

Local Metal mapping at --ctx 512:

source mapped: 82697.67 MiB
REAP25 mapped: 65397.66 MiB
saved: ~17300 MiB, about 16.9 GiB

Expert CSV

reap_dataset_10k_balanced_seed42_reap25_experts.csv contains per-expert statistics for all 43 MoE layers. Columns:

Column Description
layer Layer index (0–42)
expert_id Original expert ID in source GGUF
new_expert_id Compacted expert ID after pruning (-1 if pruned)
activation_policy hash_preserved (layers 0–2) / router_mask_pruned
kept Whether this expert is kept in the pruned GGUF
pruned Whether this expert was removed
total_tokens Total observed tokens (shared per layer)
expert_frequency How many times this expert was selected
selection_rate_per_token expert_frequency / total_tokens
selection_share Fraction of all expert selections for this layer
reap Composite REAP score (activation_energy_sum2)
gate_up_energy Gate/up projection energy contribution
down_energy Down projection energy contribution

Run With Bundled Runtime

The Metal runtime loads shader source files from metal/*.metal, so run from inside the bundled runtime directory:

cd ds4_reap_runtime

./ds4 \
  -m ../DeepSeek-V4-Flash-REAP25-REAPDataset10K-Balanced-DS4-compact-IQ2XXS.gguf \
  --ctx 512 --nothink --temp 0 -n 64 \
  -p 'Hello!'

For OpenAI-compatible local serving:

cd ds4_reap_runtime

./ds4-server \
  -m ../DeepSeek-V4-Flash-REAP25-REAPDataset10K-Balanced-DS4-compact-IQ2XXS.gguf \
  --ctx 32768 --tokens 1024 \
  --host 127.0.0.1 --port 8000

Comparison with LCB50 Model

Property REAP25-LCB50 REAP25-REAPDataset10K-Balanced (this)
Calibration dataset LiveCodeBench REAP dataset (8 domains)
Sample count 50 10,000
Language balance English only 50% EN / 50% KO
Domain coverage Competitive coding Code, Math, Science, Function-calling, SWE
Prompt tokens observed 26,386 27,592,731
Expert route selections 6,807,588 7,118,924,598
Compression REAP25 (256→192 experts) REAP25 (256→192 experts)
Output size 63.87 GiB 63.87 GiB

Notes

This is a broader calibration artifact than the LCB50 model. The 10K balanced dataset covers coding, math, science, function-calling, and software engineering domains, with equal Korean and English coverage, providing more representative expert activation statistics.

The REAP pruning removes the 64 least-activated routed experts per layer (layers 3–42) and physically compacts the remaining 192 into a smaller GGUF, so the runtime must read the REAP routing metadata rather than using the original expert slot layout.

Downloads last month
123
GGUF
Model size
220B params
Architecture
deepseek4
Hardware compatibility
Log In to add your hardware

We're not able to determine the quantization variants.

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support