Instructions to use deepseek-ai/DeepSeek-V4-Flash with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use deepseek-ai/DeepSeek-V4-Flash with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="deepseek-ai/DeepSeek-V4-Flash")
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoModel
model = AutoModel.from_pretrained("deepseek-ai/DeepSeek-V4-Flash", dtype="auto")

Inference
HuggingChat
Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use deepseek-ai/DeepSeek-V4-Flash with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "deepseek-ai/DeepSeek-V4-Flash"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "deepseek-ai/DeepSeek-V4-Flash",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/deepseek-ai/DeepSeek-V4-Flash

SGLang

How to use deepseek-ai/DeepSeek-V4-Flash with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "deepseek-ai/DeepSeek-V4-Flash" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "deepseek-ai/DeepSeek-V4-Flash",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "deepseek-ai/DeepSeek-V4-Flash" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "deepseek-ai/DeepSeek-V4-Flash",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use deepseek-ai/DeepSeek-V4-Flash with Docker Model Runner:
```
docker model run hf.co/deepseek-ai/DeepSeek-V4-Flash
```

How to run deepseek on Ada GPUs？Mine is L20.

#25

by XiaoZaiyi - opened Apr 29

Discussion

XiaoZaiyi

Apr 29

Does the L20 card not support this model? I am using vllm.

mattduerrmeier

May 1

The L20 GPU has 48 GB of memory, so you don't have enough space to load the DeepSeek-V4 models. From my understanding you need at least 158~ GB of memory for V4-Flash.

XiaoZaiyi

May 1

The L20 GPU has 48 GB of memory, so you don't have enough space to load the DeepSeek-V4 models. From my understanding you need at least 158~ GB of memory for V4-Flash.

I have 8*L20. GPU memory enough， The architecture simply doesn't support running it.

S1quence

May 3

The L20 GPU has 48 GB of memory, so you don't have enough space to load the DeepSeek-V4 models. From my understanding you need at least 158~ GB of memory for V4-Flash.

I have 8*L20. GPU memory enough， The architecture simply doesn't support running it.

This PR may help you, I have not tried this PR yet. https://github.com/vllm-project/vllm/pull/40906 But seems the decoding speed is not satisfying.

mattduerrmeier

May 4

Unfortunately L20 is SM89, so it will not be officially supported by vLLM. From: https://github.com/vllm-project/vllm/issues/40902:

We don't plan to support hardwares under SM90 in the official repo since that will introduce significant maintenance overhead.

The PR is your best bet. Alternatively, start from the inference code provided with DeepSeek-V4.

vimacs-hacks

16 days ago

Ktransformers can run DeepSeek-V4 on Ada cards, I've already tried with RTX Ada and L20. However, I don't know how to configure the chat template to enable thinking.

https://github.com/kvcache-ai/ktransformers/blob/main/doc/en/DeepSeek-V4-Flash.md

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment