Instructions to use unsloth/gpt-oss-20b-BF16 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use unsloth/gpt-oss-20b-BF16 with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="unsloth/gpt-oss-20b-BF16")
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("unsloth/gpt-oss-20b-BF16")
model = AutoModelForCausalLM.from_pretrained("unsloth/gpt-oss-20b-BF16")
messages = [
    {"role": "user", "content": "Who are you?"},
]
inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use unsloth/gpt-oss-20b-BF16 with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "unsloth/gpt-oss-20b-BF16"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "unsloth/gpt-oss-20b-BF16",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/unsloth/gpt-oss-20b-BF16

SGLang

How to use unsloth/gpt-oss-20b-BF16 with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "unsloth/gpt-oss-20b-BF16" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "unsloth/gpt-oss-20b-BF16",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "unsloth/gpt-oss-20b-BF16" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "unsloth/gpt-oss-20b-BF16",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Unsloth Studio new

How to use unsloth/gpt-oss-20b-BF16 with Unsloth Studio:

Install Unsloth Studio (macOS, Linux, WSL)

curl -fsSL https://unsloth.ai/install.sh | sh
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for unsloth/gpt-oss-20b-BF16 to start chatting

Install Unsloth Studio (Windows)

irm https://unsloth.ai/install.ps1 | iex
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for unsloth/gpt-oss-20b-BF16 to start chatting

Using HuggingFace Spaces for Unsloth

# No setup required
# Open https://huggingface.co/spaces/unsloth/studio in your browser
# Search for unsloth/gpt-oss-20b-BF16 to start chatting

Load model with FastModel

pip install unsloth
from unsloth import FastModel
model, tokenizer = FastModel.from_pretrained(
    model_name="unsloth/gpt-oss-20b-BF16",
    max_seq_length=2048,
)

Docker Model Runner
How to use unsloth/gpt-oss-20b-BF16 with Docker Model Runner:
```
docker model run hf.co/unsloth/gpt-oss-20b-BF16
```

Chat template differs from OpenAI's. Is it expected?

by gentry1337 - opened Aug 28, 2025

Discussion

gentry1337

Aug 28, 2025

Repro:

from transformers import AutoTokenizer

chat = [
    {"role": "system", "content": "Let's sing!"},
    {"role": "user", "content": "Because maybe"},
    {"role": "assistant", "content": "You're gonna be the one that saves me"},
    {"role": "user", "content": "And after all"},
    {"role": "assistant", "content": "You're my wonderwall"},
]

openai_tokenizer = AutoTokenizer.from_pretrained("openai/gpt-oss-20b")
unsloth_tokenizer = AutoTokenizer.from_pretrained("unsloth/gpt-oss-20b-BF16")

openai_result = openai_tokenizer.apply_chat_template(chat, tokenize=False)
unsloth_result = unsloth_tokenizer.apply_chat_template(chat, tokenize=False)
if openai_result != unsloth_result:
    print("Results differ:")
    print("OpenAI result: ", openai_result)
    print("Unsloth result: ", unsloth_result)

Output:

Results differ:
OpenAI result:  <|start|>system<|message|>You are ChatGPT, a large language model trained by OpenAI.
Knowledge cutoff: 2024-06
Current date: 2025-08-28

Reasoning: medium

# Valid channels: analysis, commentary, final. Channel must be included for every message.<|end|><|start|>developer<|message|># Instructions

Let's sing!

<|end|><|start|>user<|message|>Because maybe<|end|><|start|>assistant<|channel|>final<|message|>You're gonna be the one that saves me<|end|><|start|>user<|message|>And after all<|end|><|start|>assistant<|channel|>final<|message|>You're my wonderwall<|return|>
Unsloth result:  <|start|>system<|message|>You are ChatGPT, a large language model trained by OpenAI.
Knowledge cutoff: 2024-06
Current date: 2025-08-28

Reasoning: medium

# Valid channels: analysis, commentary, final. Channel must be included for every message.
Calls to these tools must go to the commentary channel: 'functions'.<|end|><|start|>developer<|message|># Instructions

Let's sing!<|end|><|start|>user<|message|>Because maybe<|end|><|start|>assistant<|message|>You're gonna be the one that saves me<|end|><|start|>user<|message|>And after all<|end|><|start|>assistant<|message|>You're my wonderwall<|return|>

gentry1337

Aug 28, 2025

I've also run this test for unsloth 20b vs unsloth 120b and openai 20b vs openai 120b - unsloth tokenization even differs between 20B and 120B tokenizers while OpenAI's matches between different model sizes.

gentry1337

Aug 28, 2025

•

edited Aug 28, 2025

Looks like I've found some explanations of difference of chat templates:
https://docs.unsloth.ai/basics/gpt-oss-how-to-run-and-fine-tune#unsloth-fixes-for-gpt-oss
While this indeed explains the difference between OpenAI's chat template and Unsloth's one, it doesn't explain the difference between Unsloth 20B and Unsloth 120B GPT-OSS:

Results differ:
Unsloth 20B result:  <|start|>system<|message|>You are ChatGPT, a large language model trained by OpenAI.
Knowledge cutoff: 2024-06
Current date: 2025-08-28

Reasoning: medium

# Valid channels: analysis, commentary, final. Channel must be included for every message.
Calls to these tools must go to the commentary channel: 'functions'.<|end|><|start|>developer<|message|># Instructions

Let's sing!<|end|><|start|>user<|message|>Because maybe<|end|><|start|>assistant<|message|>You're gonna be the one that saves me<|end|><|start|>user<|message|>And after all<|end|><|start|>assistant<|message|>You're my wonderwall<|return|>
Unsloth 120B result:  <|start|>system<|message|>You are ChatGPT, a large language model trained by OpenAI.
Knowledge cutoff: 2024-06
Current date: 2025-08-28

Reasoning: medium

# Valid channels: analysis, commentary, final. Channel must be included for every message.<|end|><|start|>developer<|message|># Instructions

Let's sing!<|end|><|start|>user<|message|>Because maybe<|end|><|start|>assistant<|channel|>final<|message|>You're gonna be the one that saves me<|end|><|start|>user<|message|>And after all<|end|><|start|>assistant<|channel|>final<|message|>You're my wonderwall<|return|>

It seems like GPT-OSS 120B misses Calls to these tools must go to the commentary channel: instruction

GauravEA

Jan 29

I am currently fine-tuning the GPT-OSS 20B model using Unsloth with HuggingFace TRL (SFTTrainer).

Long-term goal

Serve the model in production using Triton with either vLLM or TensorRT-LLM as the backend

Short-term / initial deployment using Ollama (GGUF)

Current challenge
GPT-OSS uses a Harmony-style chat template, which includes:

developer role

Explicit EOS handling

thinking / analysis channels

Tool / function calling structure

When converting the fine-tuned model to GGUF and deploying it in Ollama using the default GPT-OSS Modelfile, I am running into ambiguity around:

Whether the default Jinja chat template provided by GPT-OSS should be modified for Ollama compatibility

How to correctly handle:

EOS token behavior

Internal reasoning / analysis channels

Developer role alignment

How to do this without degrading the model’s default performance or alignment

Constraints / Intent

I already have training data prepared strictly in system / user / assistant format

I want to:

Preserve GPT-OSS’s native behavior as much as possible

Perform accurate, non-destructive fine-tuning

Avoid hacks that work short-term but break compatibility with vLLM / TensorRT-LLM later

What I’m looking for

Has anyone successfully:

Fine-tuned GPT-OSS

Converted it to GGUF

Deployed it with Ollama

While preserving the Harmony template behavior?

If yes:

Did you modify the chat template / Modelfile?

How did you handle EOS + reasoning channels?

Any pitfalls to avoid to keep it production-ready for Triton later?

Any concrete guidance, references, or proven setups would be extremely helpful.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment