Instructions to use eouya2/DeepSeek-V4-Flash-REAP25-REAPDataset10K-BalancedWithKO-DS4 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use eouya2/DeepSeek-V4-Flash-REAP25-REAPDataset10K-BalancedWithKO-DS4 with llama-cpp-python:

# !pip install llama-cpp-python

from llama_cpp import Llama

llm = Llama.from_pretrained(
	repo_id="eouya2/DeepSeek-V4-Flash-REAP25-REAPDataset10K-BalancedWithKO-DS4",
	filename="DeepSeek-V4-Flash-REAP25-REAPDataset10K-Balanced-DS4-compact-IQ2XXS.gguf",
)

llm.create_chat_completion(
	messages = [
		{
			"role": "user",
			"content": "What is the capital of France?"
		}
	]
)

Notebooks
Google Colab
Kaggle
Local Apps Settings

llama.cpp

How to use eouya2/DeepSeek-V4-Flash-REAP25-REAPDataset10K-BalancedWithKO-DS4 with llama.cpp:

Install from brew

brew install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf eouya2/DeepSeek-V4-Flash-REAP25-REAPDataset10K-BalancedWithKO-DS4
# Run inference directly in the terminal:
llama-cli -hf eouya2/DeepSeek-V4-Flash-REAP25-REAPDataset10K-BalancedWithKO-DS4

Install from WinGet (Windows)

winget install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf eouya2/DeepSeek-V4-Flash-REAP25-REAPDataset10K-BalancedWithKO-DS4
# Run inference directly in the terminal:
llama-cli -hf eouya2/DeepSeek-V4-Flash-REAP25-REAPDataset10K-BalancedWithKO-DS4

Use pre-built binary

# Download pre-built binary from:
# https://github.com/ggerganov/llama.cpp/releases
# Start a local OpenAI-compatible server with a web UI:
./llama-server -hf eouya2/DeepSeek-V4-Flash-REAP25-REAPDataset10K-BalancedWithKO-DS4
# Run inference directly in the terminal:
./llama-cli -hf eouya2/DeepSeek-V4-Flash-REAP25-REAPDataset10K-BalancedWithKO-DS4

Build from source code

git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
cmake -B build
cmake --build build -j --target llama-server llama-cli
# Start a local OpenAI-compatible server with a web UI:
./build/bin/llama-server -hf eouya2/DeepSeek-V4-Flash-REAP25-REAPDataset10K-BalancedWithKO-DS4
# Run inference directly in the terminal:
./build/bin/llama-cli -hf eouya2/DeepSeek-V4-Flash-REAP25-REAPDataset10K-BalancedWithKO-DS4

Use Docker

docker model run hf.co/eouya2/DeepSeek-V4-Flash-REAP25-REAPDataset10K-BalancedWithKO-DS4

LM Studio
Jan

vLLM

How to use eouya2/DeepSeek-V4-Flash-REAP25-REAPDataset10K-BalancedWithKO-DS4 with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "eouya2/DeepSeek-V4-Flash-REAP25-REAPDataset10K-BalancedWithKO-DS4"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "eouya2/DeepSeek-V4-Flash-REAP25-REAPDataset10K-BalancedWithKO-DS4",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/eouya2/DeepSeek-V4-Flash-REAP25-REAPDataset10K-BalancedWithKO-DS4

Ollama
How to use eouya2/DeepSeek-V4-Flash-REAP25-REAPDataset10K-BalancedWithKO-DS4 with Ollama:
```
ollama run hf.co/eouya2/DeepSeek-V4-Flash-REAP25-REAPDataset10K-BalancedWithKO-DS4
```

Unsloth Studio

How to use eouya2/DeepSeek-V4-Flash-REAP25-REAPDataset10K-BalancedWithKO-DS4 with Unsloth Studio:

Install Unsloth Studio (macOS, Linux, WSL)

curl -fsSL https://unsloth.ai/install.sh | sh
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for eouya2/DeepSeek-V4-Flash-REAP25-REAPDataset10K-BalancedWithKO-DS4 to start chatting

Install Unsloth Studio (Windows)

irm https://unsloth.ai/install.ps1 | iex
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for eouya2/DeepSeek-V4-Flash-REAP25-REAPDataset10K-BalancedWithKO-DS4 to start chatting

Using HuggingFace Spaces for Unsloth

# No setup required
# Open https://huggingface.co/spaces/unsloth/studio in your browser
# Search for eouya2/DeepSeek-V4-Flash-REAP25-REAPDataset10K-BalancedWithKO-DS4 to start chatting

How to use eouya2/DeepSeek-V4-Flash-REAP25-REAPDataset10K-BalancedWithKO-DS4 with Pi:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama-server -hf eouya2/DeepSeek-V4-Flash-REAP25-REAPDataset10K-BalancedWithKO-DS4

Configure the model in Pi

# Install Pi:
npm install -g @mariozechner/pi-coding-agent
# Add to ~/.pi/agent/models.json:
{
  "providers": {
    "llama-cpp": {
      "baseUrl": "http://localhost:8080/v1",
      "api": "openai-completions",
      "apiKey": "none",
      "models": [
        {
          "id": "eouya2/DeepSeek-V4-Flash-REAP25-REAPDataset10K-BalancedWithKO-DS4"
        }
      ]
    }
  }
}

Run Pi

# Start Pi in your project directory:
pi

Hermes Agent new

How to use eouya2/DeepSeek-V4-Flash-REAP25-REAPDataset10K-BalancedWithKO-DS4 with Hermes Agent:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama-server -hf eouya2/DeepSeek-V4-Flash-REAP25-REAPDataset10K-BalancedWithKO-DS4

Configure Hermes

# Install Hermes:
curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash
hermes setup
# Point Hermes at the local server:
hermes config set model.provider custom
hermes config set model.base_url http://127.0.0.1:8080/v1
hermes config set model.default eouya2/DeepSeek-V4-Flash-REAP25-REAPDataset10K-BalancedWithKO-DS4

Run Hermes

hermes

Docker Model Runner
How to use eouya2/DeepSeek-V4-Flash-REAP25-REAPDataset10K-BalancedWithKO-DS4 with Docker Model Runner:
```
docker model run hf.co/eouya2/DeepSeek-V4-Flash-REAP25-REAPDataset10K-BalancedWithKO-DS4
```

Lemonade

How to use eouya2/DeepSeek-V4-Flash-REAP25-REAPDataset10K-BalancedWithKO-DS4 with Lemonade:

Pull the model

# Download Lemonade from https://lemonade-server.ai/
lemonade pull eouya2/DeepSeek-V4-Flash-REAP25-REAPDataset10K-BalancedWithKO-DS4

Run and chat with the model

lemonade run user.DeepSeek-V4-Flash-REAP25-REAPDataset10K-BalancedWithKO-DS4-{{QUANT_TAG}}

List all available models

lemonade list

DeepSeek-V4-Flash REAP25 REAPDataset10K-Balanced DS4 GGUF

Experimental DS4 compact GGUF made by applying 25% REAP expert pruning to a DeepSeek-V4-Flash DS4 GGUF, calibrated on 10,000 language-balanced prompts drawn from 8 domains of the REAP dataset.

Model file:

DeepSeek-V4-Flash-REAP25-REAPDataset10K-Balanced-DS4-compact-IQ2XXS.gguf

Bundled runtime:

ds4_reap_runtime/

Expert observation results:

reap_dataset_10k_balanced_seed42_reap25_experts.csv

Compatibility

This model needs the bundled REAP-aware DS4 runtime, or another DS4 build that supports ds4-compact-v1.

It is not expected to run with stock DS4, llama.cpp, Ollama, LM Studio, or other generic GGUF loaders. The routed expert tensors are physically compacted, so the runtime must read the REAP metadata and route into compact expert ids.

Expected DS4 runtime line:

REAP runtime metadata enabled: hash_preserved=3 router_masked=40 moe_disabled=0 layout=ds4-compact-v1

How It Was Made

Source GGUF

DeepSeek-V4-Flash-IQ2XXS-w2Q2K-AProjQ8-SExpQ8-OutQ8-chat-v2-imatrix.gguf

Calibration Dataset

Category	Source Dataset	Samples	EN	KO
mixture/code	open-r1/codeforces-cots	2,000	1,000	1,000
mixture/math	open-r1/OpenR1-Math-220k	2,000	1,000	1,000
mixture/science	nvidia/Llama-Nemotron-Post-Training-Dataset	2,000	1,000	1,000
xlam/function-calling	Salesforce/xlam-function-calling-60k	2,000	1,000	1,000
SWE/tool	SWE-bench style (tool-use split)	500	250	250
SWE/xml	SWE-bench style (XML format split)	500	250	250
SWE/ticks	SWE-bench style (tick-format split)	500	250	250
SWE/train	SWE-bench style (training split)	500	250	250
Total		10,000	5,000	5,000

Sampling: random with seed 42
Language balance: --balance-language enforced 50% English / 50% Korean per source category
Total token coverage: 27,592,731 observed prompt tokens
Observed expert route selections: 7,118,924,598

Observation

Seed: 42
Context length: 4,096
Chunk size: 100 prompts per chunk (100 chunks total, resumable)
Score metric: activation_energy_sum2

Pruning

Layers 0–2: preserved, hash-routed
Layers 3–42: REAP-pruned
Compression ratio: 0.25
Experts per pruned layer: 256 → 192 (64 pruned per layer)
Top-k remains 6
Layout: ds4-compact-v1
Expert tensor bytes are copied directly, preserving source quantization

Size

source file: 80.76 GiB / 86.72 GB
REAP25 file: 63.87 GiB / 68.58 GB

Local Metal mapping at --ctx 512:

source mapped: 82697.67 MiB
REAP25 mapped: 65397.66 MiB
saved: ~17300 MiB, about 16.9 GiB

Expert CSV

reap_dataset_10k_balanced_seed42_reap25_experts.csv contains per-expert statistics for all 43 MoE layers. Columns:

Column	Description
`layer`	Layer index (0–42)
`expert_id`	Original expert ID in source GGUF
`new_expert_id`	Compacted expert ID after pruning (-1 if pruned)
`activation_policy`	`hash_preserved` (layers 0–2) / `router_mask_pruned`
`kept`	Whether this expert is kept in the pruned GGUF
`pruned`	Whether this expert was removed
`total_tokens`	Total observed tokens (shared per layer)
`expert_frequency`	How many times this expert was selected
`selection_rate_per_token`	expert_frequency / total_tokens
`selection_share`	Fraction of all expert selections for this layer
`reap`	Composite REAP score (activation_energy_sum2)
`gate_up_energy`	Gate/up projection energy contribution
`down_energy`	Down projection energy contribution

Run With Bundled Runtime

The Metal runtime loads shader source files from metal/*.metal, so run from inside the bundled runtime directory:

cd ds4_reap_runtime

./ds4 \
  -m ../DeepSeek-V4-Flash-REAP25-REAPDataset10K-Balanced-DS4-compact-IQ2XXS.gguf \
  --ctx 512 --nothink --temp 0 -n 64 \
  -p 'Hello!'

For OpenAI-compatible local serving:

cd ds4_reap_runtime

./ds4-server \
  -m ../DeepSeek-V4-Flash-REAP25-REAPDataset10K-Balanced-DS4-compact-IQ2XXS.gguf \
  --ctx 32768 --tokens 1024 \
  --host 127.0.0.1 --port 8000

Comparison with LCB50 Model

Property	REAP25-LCB50	REAP25-REAPDataset10K-Balanced (this)
Calibration dataset	LiveCodeBench	REAP dataset (8 domains)
Sample count	50	10,000
Language balance	English only	50% EN / 50% KO
Domain coverage	Competitive coding	Code, Math, Science, Function-calling, SWE
Prompt tokens observed	26,386	27,592,731
Expert route selections	6,807,588	7,118,924,598
Compression	REAP25 (256→192 experts)	REAP25 (256→192 experts)
Output size	63.87 GiB	63.87 GiB

Notes

This is a broader calibration artifact than the LCB50 model. The 10K balanced dataset covers coding, math, science, function-calling, and software engineering domains, with equal Korean and English coverage, providing more representative expert activation statistics.

The REAP pruning removes the 64 least-activated routed experts per layer (layers 3–42) and physically compacts the remaining 192 into a smaller GGUF, so the runtime must read the REAP routing metadata rather than using the original expert slot layout.

Downloads last month: 123

GGUF

Model size

220B params

Architecture

deepseek4

Hardware compatibility

We're not able to determine the quantization variants.

View all variants