krishnateja95's picture
Update README.md
d668ee3 verified
metadata
license: apache-2.0
base_model:
  - meta-llama/Llama-3.1-8B-Instruct

Llama-3.1-8B-Instruct-KV-Cache-FP8

Model Overview

  • Model Architecture: nm-testing/Llama-3.1-8B-Instruct-KV-Cache-FP8
    • Input: Text
    • Output: Text
  • Release Date:
  • Version: 1.0
  • Model Developers:: Red Hat

FP8 KV Cache Quantization of meta-llama/Llama-3.1-8B-Instruct.

Model Optimizations

This model was obtained by quantizing the KV Cache of weights and activations of meta-llama/Llama-3.1-8B-Instruct to FP8 data type.

Deployment

Use with vLLM

  1. Initialize vLLM server:
vllm serve RedHatAI/Llama-3.1-8B-Instruct-KV-Cache-FP8 --tensor_parallel_size 1
  1. Send requests to the server:
from openai import OpenAI

# Modify OpenAI's API key and API base to use vLLM's API server.
openai_api_key = "EMPTY"
openai_api_base = "http://<your-server-host>:8000/v1"

client = OpenAI(
    api_key=openai_api_key,
    base_url=openai_api_base,
)

model = "RedHatAI/Llama-3.1-8B-Instruct-KV-Cache-FP8"

messages = [
    {"role": "user", "content": "Explain quantum mechanics clearly and concisely."},
]


outputs = client.chat.completions.create(
    model=model,
    messages=messages,
)

generated_text = outputs.choices[0].message.content
print(generated_text)

Evaluation

The model was evaluated on the RULER and long-context benchmarks (LongBench), using lm-evaluation-harness. vLLM was used for all evaluations.

Accuracy

Category Metric meta-llama/Llama-3.1-8B-Instruct nm-testing/Llama-3.1-8B-Instruct-KV-Cache-FP8 Recovery (%)
LongBench V1 Task 1 abc ijk xyz
NIAH niah_single_1 abc ijk xyz
niah_single_2 abc ijk xyz
niah_single_3 abc ijk xyz
niah_multikey_1 abc ijk xyz
niah_multikey_2 abc ijk xyz
niah_multikey_3 abc ijk xyz
Average Score abc ijk xyz