Instructions to use yeonseok-zeticai/QWEN_2.5_omni_decoder with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use yeonseok-zeticai/QWEN_2.5_omni_decoder with llama-cpp-python:

# !pip install llama-cpp-python

from llama_cpp import Llama

llm = Llama.from_pretrained(
	repo_id="yeonseok-zeticai/QWEN_2.5_omni_decoder",
	filename="Qwen2.5-Omni-7B-decoder-Q8_0.gguf",
)

llm.create_chat_completion(
	messages = "No input example has been defined for this model task."
)

Notebooks
Google Colab
Kaggle
Local Apps

llama.cpp

How to use yeonseok-zeticai/QWEN_2.5_omni_decoder with llama.cpp:

Install from brew

brew install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf yeonseok-zeticai/QWEN_2.5_omni_decoder:Q8_0
# Run inference directly in the terminal:
llama-cli -hf yeonseok-zeticai/QWEN_2.5_omni_decoder:Q8_0

Install from WinGet (Windows)

winget install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf yeonseok-zeticai/QWEN_2.5_omni_decoder:Q8_0
# Run inference directly in the terminal:
llama-cli -hf yeonseok-zeticai/QWEN_2.5_omni_decoder:Q8_0

Use pre-built binary

# Download pre-built binary from:
# https://github.com/ggerganov/llama.cpp/releases
# Start a local OpenAI-compatible server with a web UI:
./llama-server -hf yeonseok-zeticai/QWEN_2.5_omni_decoder:Q8_0
# Run inference directly in the terminal:
./llama-cli -hf yeonseok-zeticai/QWEN_2.5_omni_decoder:Q8_0

Build from source code

git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
cmake -B build
cmake --build build -j --target llama-server llama-cli
# Start a local OpenAI-compatible server with a web UI:
./build/bin/llama-server -hf yeonseok-zeticai/QWEN_2.5_omni_decoder:Q8_0
# Run inference directly in the terminal:
./build/bin/llama-cli -hf yeonseok-zeticai/QWEN_2.5_omni_decoder:Q8_0

Use Docker

docker model run hf.co/yeonseok-zeticai/QWEN_2.5_omni_decoder:Q8_0

LM Studio
Jan
Ollama
How to use yeonseok-zeticai/QWEN_2.5_omni_decoder with Ollama:
```
ollama run hf.co/yeonseok-zeticai/QWEN_2.5_omni_decoder:Q8_0
```

Unsloth Studio new

How to use yeonseok-zeticai/QWEN_2.5_omni_decoder with Unsloth Studio:

Install Unsloth Studio (macOS, Linux, WSL)

curl -fsSL https://unsloth.ai/install.sh | sh
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for yeonseok-zeticai/QWEN_2.5_omni_decoder to start chatting

Install Unsloth Studio (Windows)

irm https://unsloth.ai/install.ps1 | iex
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for yeonseok-zeticai/QWEN_2.5_omni_decoder to start chatting

Using HuggingFace Spaces for Unsloth

# No setup required
# Open https://huggingface.co/spaces/unsloth/studio in your browser
# Search for yeonseok-zeticai/QWEN_2.5_omni_decoder to start chatting

Docker Model Runner
How to use yeonseok-zeticai/QWEN_2.5_omni_decoder with Docker Model Runner:
```
docker model run hf.co/yeonseok-zeticai/QWEN_2.5_omni_decoder:Q8_0
```

Lemonade

How to use yeonseok-zeticai/QWEN_2.5_omni_decoder with Lemonade:

Pull the model

# Download Lemonade from https://lemonade-server.ai/
lemonade pull yeonseok-zeticai/QWEN_2.5_omni_decoder:Q8_0

Run and chat with the model

lemonade run user.QWEN_2.5_omni_decoder-Q8_0

List all available models

lemonade list

Qwen2.5-Omni-7B Decoder-Only (GGUF)

Text decoder extracted from Qwen/Qwen2.5-Omni-7B — all vision, audio, talker, and token2wav components removed.

This is a pure text LLM (7.62B params) that runs standalone in llama.cpp without any multimodal dependencies.

Model Details

Parameter	Value
Architecture	Qwen2VL (text decoder only)
Parameters	7.62B
Hidden size	3584
Layers	28
Attention heads	28 (4 KV heads, GQA)
FFN intermediate	18944
Vocab size	152064
Max context	32768
RoPE base	1000000
Tokenizer	GPT2-style BPE

Files

File	Size	BPW	Description
`Qwen2.5-Omni-7B-decoder-Q8_0.gguf`	7.6 GB	8.50	Q8_0 quantized

How It Was Made

Extracted using llama.cpp's convert_hf_to_gguf.py which automatically:

Strips thinker. prefix from weight names
Drops all visual.*, audio.*, talker.*, token2wav.* tensors
Outputs a standard Qwen2.5 text decoder GGUF

# Step 1: Extract decoder to F16
python convert_hf_to_gguf.py Qwen/Qwen2.5-Omni-7B \
    --outfile Qwen2.5-Omni-7B-decoder-F16.gguf --outtype f16

# Step 2: Quantize to Q8_0
llama-quantize Qwen2.5-Omni-7B-decoder-F16.gguf \
    Qwen2.5-Omni-7B-decoder-Q8_0.gguf Q8_0

Usage with llama.cpp

# Benchmark
./llama-bench -m Qwen2.5-Omni-7B-decoder-Q8_0.gguf -t 6 -p 512 -n 128 -fa 1

# Text generation
./llama-cli -m Qwen2.5-Omni-7B-decoder-Q8_0.gguf -p "Hello" -n 200

# Further quantize locally
llama-quantize Qwen2.5-Omni-7B-decoder-Q8_0.gguf \
    Qwen2.5-Omni-7B-decoder-Q4_0.gguf Q4_0

Component Breakdown (Full Omni Model)

The full Qwen2.5-Omni-7B (10.73B params) consists of:

Component	Params	Description
Decoder (this repo)	7.62B	Text LLM
Vision Encoder	0.68B	ViT (32 layers)
Audio Encoder	0.64B	Whisper-style (32 layers)
Talker	1.35B	Speech decoder (24 layers)
Token2Wav	0.45B	DiT + BigVGAN vocoder

License

Apache 2.0 (same as base model)

Downloads last month: 7

GGUF

Model size

8B params

Architecture

qwen2vl

Hardware compatibility

8-bit

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for yeonseok-zeticai/QWEN_2.5_omni_decoder

Base model

Qwen/Qwen2.5-Omni-7B

Quantized

(21)

this model