Chat basics

Chat models are conversational models you can send a message to and receive a response. Most language models from mid-2023 onwards are chat models and may be referred to as “instruct” or “instruction-tuned” models. Models that do not support chat are often referred to as “base” or “pretrained” models.

Larger and newer models are generally more capable, but models specialized in certain domains (medical, legal text, non-English languages, etc.) can often outperform these larger models. Try leaderboards like OpenLLM and LMSys Chatbot Arena to help you identify the best model for your use case.

This guide shows you how to quickly load chat models in Transformers from the command line, how to build and format a conversation, and how to chat using the TextGenerationPipeline.

chat CLI

After you’ve installed Transformers, you can chat with a model directly from the command line. The command below launches an interactive session with a model, with a few base commands listed at the start of the session.

transformers chat Qwen/Qwen2.5-0.5B-Instruct

You can launch the CLI with arbitrary generate flags, with the format arg_1=value_1 arg_2=value_2 ...

transformers chat Qwen/Qwen2.5-0.5B-Instruct do_sample=False max_new_tokens=10

For a full list of options, run the command below.

transformers chat -h

The chat is implemented on top of the AutoClass, using tooling from text generation and chat. It uses the transformers serve CLI under the hood (docs).

TextGenerationPipeline

TextGenerationPipeline is a high-level text generation class with a “chat mode”. Chat mode is enabled when a conversational model is detected and the chat prompt is properly formatted.

Chat models accept a list of messages (the chat history) as the input. Each message is a dictionary with role and content keys. To start the chat, add a single user message. You can also optionally include a system message to give the model directions on how to behave.

chat = [
    {"role": "system", "content": "You are a helpful science assistant."},
    {"role": "user", "content": "Hey, can you explain gravity to me?"}
]

Create the TextGenerationPipeline and pass chat to it. For large models, setting device_map=“auto” helps load the model quicker and automatically places it on the fastest device available.

import torch
from transformers import pipeline

pipeline = pipeline(task="text-generation", model="HuggingFaceTB/SmolLM2-1.7B-Instruct", dtype="auto", device_map="auto")
response = pipeline(chat, max_new_tokens=512)
print(response[0]["generated_text"][-1]["content"])

If this works successfully, you should see a response from the model! If you want to continue the conversation, you need to update the chat history with the model’s response. You can do this either by appending the text to chat (use the assistant role), or by reading response[0]["generated_text"], which contains the full chat history, including the most recent response.

Once you have the model’s response, you can continue the conversation by appending a new user message to the chat history.

chat = response[0]["generated_text"]
chat.append(
    {"role": "user", "content": "Woah! But can it be reconciled with quantum mechanics?"}
)
response = pipeline(chat, max_new_tokens=512)
print(response[0]["generated_text"][-1]["content"])

By repeating this process, you can continue the conversation as long as you like, at least until the model runs out of context window or you run out of memory.

Performance and memory usage

Transformers load models in full float32 precision by default, and for a 8B model, this requires ~32GB of memory! Use the torch_dtype="auto" argument, which generally uses bfloat16 for models that were trained with it, to reduce your memory usage.

Refer to the Quantization docs for more information about the different quantization backends available.

To lower memory usage even lower, you can quantize the model to 8-bit or 4-bit with bitsandbytes. Create a BitsAndBytesConfig with your desired quantization settings and pass it to the pipelines model_kwargs parameter. The example below quantizes a model to 8-bits.

from transformers import pipeline, BitsAndBytesConfig

quantization_config = BitsAndBytesConfig(load_in_8bit=True)
pipeline = pipeline(task="text-generation", model="meta-llama/Meta-Llama-3-8B-Instruct", device_map="auto", model_kwargs={"quantization_config": quantization_config})

In general, model size and performance are directly correlated. Larger models are slower in addition to requiring more memory because each active parameter must be read from memory for every generated token. This is a bottleneck for LLM text generation and the main options for improving generation speed are to either quantize a model or use hardware with higher memory bandwidth. Adding more compute power doesn’t meaningfully help.

You can also try techniques like speculative decoding, where a smaller model generates candidate tokens that are verified by the larger model. If the candidate tokens are correct, the larger model can generate more than one token at a time. This significantly alleviates the bandwidth bottleneck and improves generation speed.

Mixture-of-Expert (MoE) models such as Mixtral, Qwen2MoE, and GPT-OSS have lots of parameters, but only “activate” a small fraction of them to generate each token. As a result, MoE models generally have much lower memory bandwidth requirements and can be faster than a regular LLM of the same size. However, techniques like speculative decoding are ineffective with MoE models because more parameters become activated with each new speculated token.

< > Update on GitHub