Instructions to use Qwen/Qwen2-72B-Instruct-AWQ with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use Qwen/Qwen2-72B-Instruct-AWQ with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="Qwen/Qwen2-72B-Instruct-AWQ")
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2-72B-Instruct-AWQ")
model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen2-72B-Instruct-AWQ")
messages = [
    {"role": "user", "content": "Who are you?"},
]
inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use Qwen/Qwen2-72B-Instruct-AWQ with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "Qwen/Qwen2-72B-Instruct-AWQ"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "Qwen/Qwen2-72B-Instruct-AWQ",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/Qwen/Qwen2-72B-Instruct-AWQ

SGLang

How to use Qwen/Qwen2-72B-Instruct-AWQ with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "Qwen/Qwen2-72B-Instruct-AWQ" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "Qwen/Qwen2-72B-Instruct-AWQ",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "Qwen/Qwen2-72B-Instruct-AWQ" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "Qwen/Qwen2-72B-Instruct-AWQ",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use Qwen/Qwen2-72B-Instruct-AWQ with Docker Model Runner:
```
docker model run hf.co/Qwen/Qwen2-72B-Instruct-AWQ
```

Error AutoAWQ tensor 4 vllm

by fersebas - opened Jun 24, 2024

Discussion

fersebas

Jun 24, 2024

I have some error using 4 gpus after autoawq this model after finetuning, but this model works, how can I awq my model to work in 4 gpu I check that intermediate_size is different. Could you please give me code to reproduce this AWQ model.

Thanks so much.

prudant

Jul 6, 2024

same here, x2

fersebas

Jul 11, 2024

solved with this code executing and then change config 29696 intermediate size.
https://github.com/QwenLM/Qwen2/issues/578

import json
import os
from collections import OrderedDict
from typing import Dict

import torch
from safetensors import safe_open
from safetensors.torch import save_file
from tqdm import tqdm
from transformers.modeling_utils import (
SAFE_WEIGHTS_INDEX_NAME,
SAFE_WEIGHTS_NAME,
WEIGHTS_INDEX_NAME,
WEIGHTS_NAME,
shard_checkpoint,
)

def save_weight(input_dir: str, output_dir: str, shard_size: str, save_safetensors: bool) -> str:
qwen_state_dict: Dict[str, torch.Tensor] = OrderedDict()
for filepath in tqdm(os.listdir(input_dir), desc="Load weights"):
if os.path.isfile(os.path.join(input_dir, filepath)) and filepath.endswith(".safetensors"):
with safe_open(os.path.join(input_dir, filepath), framework="pt", device="cpu") as f:
for key in f.keys():
qwen_state_dict[key] = f.get_tensor(key)

qwen2_state_dict: Dict[str, torch.Tensor] = OrderedDict()
torch_dtype = None
for key, value in tqdm(qwen_state_dict.items(), desc="Convert format"):
    if torch_dtype is None:
        torch_dtype = value.dtype
    shape_list = [int(i) for i in value.shape]
    if len(shape_list) == 2:
        if shape_list[0] == 29568:
            value = torch.concat((value, torch.zeros([128, shape_list[1]], dtype=value.dtype)), dim=0)
        if shape_list[1] == 29568:
            value = torch.cat((value, torch.zeros([shape_list[0], 128], dtype=value.dtype)), dim=1)
    qwen2_state_dict[key] = value

weights_name = SAFE_WEIGHTS_NAME if save_safetensors else WEIGHTS_NAME
shards, index = shard_checkpoint(qwen2_state_dict, max_shard_size=shard_size, weights_name=weights_name)

for shard_file, shard in tqdm(shards.items(), desc="Save weights"):
    if save_safetensors:
        save_file(shard, os.path.join(output_dir, shard_file), metadata={"format": "pt"})
    else:
        torch.save(shard, os.path.join(output_dir, shard_file))

if index is None:
    print("Model weights saved in {}".format(os.path.join(output_dir, weights_name)))
else:
    index_name = SAFE_WEIGHTS_INDEX_NAME if save_safetensors else WEIGHTS_INDEX_NAME
    with open(os.path.join(output_dir, index_name), "w", encoding="utf-8") as f:
        json.dump(index, f, indent=2, sort_keys=True)
    print("Model weights saved in {}".format(output_dir))

return str(torch_dtype).replace("torch.", "")

fersebas

Jul 12, 2024

•

edited Jul 12, 2024

@prudant try it and tell me if u have any problem. We finally get works in 4 gpus. You have to pad the model with that code then change in config.json the configuration of intermediate size to 29696, then awq with 128 group size.

prudant

Jul 13, 2024

thanks! will try! regards

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment