Update README.md

d668ee3 verified 10 days ago

4.1 kB

metadata

license: apache-2.0
base_model:
  - meta-llama/Llama-3.1-8B-Instruct

Llama-3.1-8B-Instruct-KV-Cache-FP8

Model Overview

Model Architecture: nm-testing/Llama-3.1-8B-Instruct-KV-Cache-FP8
- Input: Text
- Output: Text
Release Date:
Version: 1.0
Model Developers:: Red Hat

FP8 KV Cache Quantization of meta-llama/Llama-3.1-8B-Instruct.

Model Optimizations

This model was obtained by quantizing the KV Cache of weights and activations of meta-llama/Llama-3.1-8B-Instruct to FP8 data type.

Deployment

Use with vLLM

Initialize vLLM server:

vllm serve RedHatAI/Llama-3.1-8B-Instruct-KV-Cache-FP8 --tensor_parallel_size 1

Send requests to the server:

from openai import OpenAI

# Modify OpenAI's API key and API base to use vLLM's API server.
openai_api_key = "EMPTY"
openai_api_base = "http://<your-server-host>:8000/v1"

client = OpenAI(
    api_key=openai_api_key,
    base_url=openai_api_base,
)

model = "RedHatAI/Llama-3.1-8B-Instruct-KV-Cache-FP8"

messages = [
    {"role": "user", "content": "Explain quantum mechanics clearly and concisely."},
]


outputs = client.chat.completions.create(
    model=model,
    messages=messages,
)

generated_text = outputs.choices[0].message.content
print(generated_text)

Evaluation

The model was evaluated on the RULER and long-context benchmarks (LongBench), using lm-evaluation-harness. vLLM was used for all evaluations.

Accuracy

Category	Metric	meta-llama/Llama-3.1-8B-Instruct	nm-testing/Llama-3.1-8B-Instruct-KV-Cache-FP8	Recovery (%)
LongBench V1	Task 1	abc	ijk	xyz
NIAH	niah_single_1	abc	ijk	xyz
	niah_single_2	abc	ijk	xyz
	niah_single_3	abc	ijk	xyz
	niah_multikey_1	abc	ijk	xyz
	niah_multikey_2	abc	ijk	xyz
	niah_multikey_3	abc	ijk	xyz
Average Score	abc	ijk	xyz