| ## How to run it | |
| There are two ways of running this models. Using Huggingface (with accelerate) or using vLLM. | |
| ### Setup enviroment | |
| For HF: | |
| ```bash | |
| pip3 install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/cu121 | |
| pip install fbgemm-gpu==0.8.0rc4 | |
| # Download the enablement fork, https://huggingface.co/sllhf/transformers_enablement_fork/tree/main unzip the file | |
| # Or clone transformers from github | |
| cd transformers | |
| # add changes from this PR https://github.com/huggingface/transformers/pull/32047 | |
| git fetch origin pull/32047/head:new-quant-method | |
| git merge new-quant-method | |
| pip install -e . | |
| # Install accelerate from main | |
| git clone https://github.com/huggingface/accelerate.git | |
| cd accelerate | |
| pip install -e . | |
| ``` | |
| For vLLM: install from main or use the nightly wheel: https://docs.vllm.ai/en/latest/getting_started/installation.html | |
| ### Load back the HF model | |
| ```python | |
| from transformers import FbgemmFp8Config, AutoModelForCausalLM, AutoTokenizer | |
| model_name = "sllhf/Meta-Llama-3.1-405B-Instruct-FP8" | |
| quantization_config = FbgemmFp8Config() | |
| quantized_model = AutoModelForCausalLM.from_pretrained( | |
| model_name, device_map="auto") | |
| tokenizer = AutoTokenizer.from_pretrained(model_name) | |
| input_text = "What are we having for dinner?" | |
| input_ids = tokenizer(input_text, return_tensors="pt").to("cuda") | |
| # make sure to set up your own params, temperature, top_p etc. | |
| output = quantized_model.generate(**input_ids, max_new_tokens=10) | |
| print(tokenizer.decode(output[0], skip_special_tokens=True)) | |
| ``` | |
| ### Run it with vLLM | |
| Follow entrypoints in https://docs.vllm.ai/ | |
| For example: | |
| ``` | |
| from vllm import LLM | |
| model = LLM("sllhf/Meta-Llama-3.1-405B-Instruct-FP8", tensor_parallel_size=8, max_model_len=8192) | |
| print(model.generate(["Hi there!"])) | |
| ``` | |