Unable to load gpt-oss-20b on dual L40 (48GB) GPUs with vLLM
I am trying to serve gpt-oss-20b using vLLM on a server equipped with 2 × NVIDIA L40 (48GB, PCIe) GPUs. According to the documentation, the model should fit on much smaller GPUs (≈16GB VRAM required with MXFP4 weights), so 2 × 48GB should be more than enough.
However, the model fails to load properly in my environment. Some details:
- Hardware: 2 × NVIDIA L40 (48GB each, PCIe, no NVLink)
Software stack:
- CUDA 12.x, driver version [insert here]
- Python 3.12
- vLLM 0.10.1+gptoss (installed via the official wheels.vllm.ai index)
Questions:
- Has anyone successfully loaded gpt-oss-20b with vLLM on dual L40 GPUs?
- Are there known issues with L40 (Ada Lovelace, SM 8.9) and the prebuilt vLLM wheels (e.g., missing arch flags)?
- Could PCIe-only topology (no NVLink) and ACS settings cause NCCL initialization failures in this setup?
- Is there any recommended configuration or workaround to ensure stable loading?
Any guidance or confirmation from others who tried a similar setup would be greatly appreciated. Thanks!
Load it on a single card. There are no benefits to loading it across 2 cards. In fact, you will degrade performance by doing so. CUDA cores are NOT additive in terms of 'speed' if your model fits within the confines of a single card - the trip over PCIe or NVLINK kills any gain you'd get in most scenarios. The only time you want to stripe a model across multiple cards is if it needs multiple cards to fit all layers in vmem. You can load two instances of the model and have each running simultaneous streams of inference with once instance on each card, but you cannot speed it up by adding more cards. Anything Ada is going to work with any Ada generation card. There are some exceptions to that with much older, pre-Ampere cards where not all implement all functionality, but Ampere, Ada, and Blackwell are all feature equivalent across the line short of datacenter configurations that add in NVLink/NVSwitch.
Are you able to run other models outside of this one? When you installed vLLM, were your environment variables set correctly to expose CUDA? What is the actual error message you are receiving? What have you done to troubleshoot?
I do not believe Ada is supported yet https://docs.vllm.ai/projects/recipes/en/latest/OpenAI/GPT-OSS.html but I hope I am wrong
такое чувство что сам vllm глючный, у меня модель 20b не запускается даже на А100 Tesla, мне интересно это OpenAI так пошутили что можно развернуть на 16Gb карте или что ?
такое чувство что сам vllm глючный, у меня модель 20b не запускается даже на А100 Tesla, мне интересно это OpenAI так пошутили что можно развернуть на 16Gb карте или что ?
Её и не нужно запускать, она бестолковая...
@Denis1981 A100 архитектуры Ampere, которая не поддерживает MXFP4 нативно. Используйте attention backend - triton kernel:
https://docs.vllm.ai/projects/recipes/en/latest/OpenAI/GPT-OSS.html#a100
hi is nvidia l40s still not supported?
Hi!
I didn't deploy the gpt-oss-20b but the 120b version on dual L40S. It took some tries but it now works no problem (32k context max length).
Work perfectly on L40S honestly with default options https://devforth.io/insights/self-hosted-gpt-real-response-time-token-throughput-and-cost-on-l4-l40s-and-h100-for-gpt-oss-20b/
Work perfectly on L40S honestly with default options https://devforth.io/insights/self-hosted-gpt-real-response-time-token-throughput-and-cost-on-l4-l40s-and-h100-for-gpt-oss-20b/
hi, are you using vllm or triton inference server with vllm backend? is there a way to turn off reasoning mode and disable returning reasoning tokens?
also did you notice a difference in latency if you set enable_prefix_caching=True and enable_prefix_caching=False?
Hey,
are you using vllm
for all experiment I was using latest version of vLLM from Dockerhub.
is there a way to turn off reasoning mode and disable returning reasoning tokens?
Well, You can set reasoning_afford to 'low' and there will be much less thoughts and less reasoning tokens. vLLM already crops all reasoning tokens for you if you are using /v1/chat/completions, and does it pretty perfectly for most of models, for GPT 20B it works perfectly. (important, this works only this chat API, not in /v1/completions - this is low level API method where you have to crop tokens by your-self for very advanced cases)
In experiments I kept reasoning_afford=medium. And reasoning is awesome for most of tasks and according to AAAI models give much better results so I would not recommend you trying to disabling it at all. I would recommend you to read about reasoning vs non-reasoning models.
also did you notice a difference in latency if you set enable_prefix_caching=True and enable_prefix_caching=False
In all benchmarks in post I "killed" on the root prefix caching by appending random number at the beginning of prompt of any experiment. So there is even no sense to manually switch enable_prefix_caching option. SO I benchmarked always a worst case where prefix cache never hits. And yes, when I tried with prefix cache, boost is very significant. But I did not mentioned it in post cause worst case is more interesting, prefix cache is generally possible only for short system prompts, and not really predictable, so it is nice bonus to performance but not a guarantee.