Llama-3.1-8B-Instruct-KV-Cache-FP8

Model Overview

  • Model Architecture: nm-testing/Llama-3.1-8B-Instruct-KV-Cache-FP8
    • Input: Text
    • Output: Text
  • Release Date:
  • Version: 1.0
  • Model Developers:: Red Hat

FP8 KV Cache Quantization of meta-llama/Llama-3.1-8B-Instruct.

Model Optimizations

This model was obtained by quantizing the KV Cache of weights and activations of meta-llama/Llama-3.1-8B-Instruct to FP8 data type.

Deployment

Use with vLLM

  1. Initialize vLLM server:
vllm serve RedHatAI/Llama-3.1-8B-Instruct-KV-Cache-FP8 --tensor_parallel_size 1
  1. Send requests to the server:
from openai import OpenAI

# Modify OpenAI's API key and API base to use vLLM's API server.
openai_api_key = "EMPTY"
openai_api_base = "http://<your-server-host>:8000/v1"

client = OpenAI(
    api_key=openai_api_key,
    base_url=openai_api_base,
)

model = "RedHatAI/Llama-3.1-8B-Instruct-KV-Cache-FP8"

messages = [
    {"role": "user", "content": "Explain quantum mechanics clearly and concisely."},
]


outputs = client.chat.completions.create(
    model=model,
    messages=messages,
)

generated_text = outputs.choices[0].message.content
print(generated_text)

Evaluation

The model was evaluated on the RULER and long-context benchmarks (LongBench), using lm-evaluation-harness. vLLM was used for all evaluations.

Accuracy

Category Metric meta-llama/Llama-3.1-8B-Instruct nm-testing/Llama-3.1-8B-Instruct-KV-Cache-FP8 Recovery (%)
LongBench V1 Task 1 abc ijk xyz
NIAH niah_single_1 abc ijk xyz
niah_single_2 abc ijk xyz
niah_single_3 abc ijk xyz
niah_multikey_1 abc ijk xyz
niah_multikey_2 abc ijk xyz
niah_multikey_3 abc ijk xyz
Average Score abc ijk xyz
Downloads last month
65
Safetensors
Model size
8B params
Tensor type
BF16
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for nm-testing/Llama-3.1-8B-Instruct-KV-Cache-FP8

Quantized
(516)
this model