Instructions to use deepseek-ai/DeepSeek-V4-Flash with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use deepseek-ai/DeepSeek-V4-Flash with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="deepseek-ai/DeepSeek-V4-Flash") messages = [ {"role": "user", "content": "Who are you?"}, ] pipe(messages)# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("deepseek-ai/DeepSeek-V4-Flash", dtype="auto") - Inference
- HuggingChat
- Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- vLLM
How to use deepseek-ai/DeepSeek-V4-Flash with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "deepseek-ai/DeepSeek-V4-Flash" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "deepseek-ai/DeepSeek-V4-Flash", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/deepseek-ai/DeepSeek-V4-Flash
- SGLang
How to use deepseek-ai/DeepSeek-V4-Flash with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "deepseek-ai/DeepSeek-V4-Flash" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "deepseek-ai/DeepSeek-V4-Flash", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "deepseek-ai/DeepSeek-V4-Flash" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "deepseek-ai/DeepSeek-V4-Flash", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Docker Model Runner
How to use deepseek-ai/DeepSeek-V4-Flash with Docker Model Runner:
docker model run hf.co/deepseek-ai/DeepSeek-V4-Flash
How to run deepseek on Ada GPUs?Mine is L20.
Does the L20 card not support this model? I am using vllm.
The L20 GPU has 48 GB of memory, so you don't have enough space to load the DeepSeek-V4 models. From my understanding you need at least 158~ GB of memory for V4-Flash.
The L20 GPU has 48 GB of memory, so you don't have enough space to load the DeepSeek-V4 models. From my understanding you need at least 158~ GB of memory for V4-Flash.
I have 8*L20. GPU memory enough, The architecture simply doesn't support running it.
The L20 GPU has 48 GB of memory, so you don't have enough space to load the DeepSeek-V4 models. From my understanding you need at least 158~ GB of memory for V4-Flash.
I have 8*L20. GPU memory enough, The architecture simply doesn't support running it.
This PR may help you, I have not tried this PR yet. https://github.com/vllm-project/vllm/pull/40906 But seems the decoding speed is not satisfying.
Unfortunately L20 is SM89, so it will not be officially supported by vLLM. From: https://github.com/vllm-project/vllm/issues/40902:
We don't plan to support hardwares under SM90 in the official repo since that will introduce significant maintenance overhead.
The PR is your best bet. Alternatively, start from the inference code provided with DeepSeek-V4.
Ktransformers can run DeepSeek-V4 on Ada cards, I've already tried with RTX Ada and L20. However, I don't know how to configure the chat template to enable thinking.
https://github.com/kvcache-ai/ktransformers/blob/main/doc/en/DeepSeek-V4-Flash.md