Llama-3.1-8B-Instruct-KV-Cache-FP8
Model Overview
- Model Architecture: nm-testing/Llama-3.1-8B-Instruct-KV-Cache-FP8
- Input: Text
- Output: Text
- Release Date:
- Version: 1.0
- Model Developers:: Red Hat
FP8 KV Cache Quantization of meta-llama/Llama-3.1-8B-Instruct.
Model Optimizations
This model was obtained by quantizing the KV Cache of weights and activations of meta-llama/Llama-3.1-8B-Instruct to FP8 data type.
Deployment
Use with vLLM
- Initialize vLLM server:
vllm serve RedHatAI/Llama-3.1-8B-Instruct-KV-Cache-FP8 --tensor_parallel_size 1
- Send requests to the server:
from openai import OpenAI
# Modify OpenAI's API key and API base to use vLLM's API server.
openai_api_key = "EMPTY"
openai_api_base = "http://<your-server-host>:8000/v1"
client = OpenAI(
api_key=openai_api_key,
base_url=openai_api_base,
)
model = "RedHatAI/Llama-3.1-8B-Instruct-KV-Cache-FP8"
messages = [
{"role": "user", "content": "Explain quantum mechanics clearly and concisely."},
]
outputs = client.chat.completions.create(
model=model,
messages=messages,
)
generated_text = outputs.choices[0].message.content
print(generated_text)
Evaluation
The model was evaluated on the RULER and long-context benchmarks (LongBench), using lm-evaluation-harness. vLLM was used for all evaluations.
Accuracy
| Category | Metric | meta-llama/Llama-3.1-8B-Instruct | nm-testing/Llama-3.1-8B-Instruct-KV-Cache-FP8 | Recovery (%) |
|---|---|---|---|---|
| LongBench V1 | Task 1 | abc | ijk | xyz |
| NIAH | niah_single_1 | abc | ijk | xyz |
| niah_single_2 | abc | ijk | xyz | |
| niah_single_3 | abc | ijk | xyz | |
| niah_multikey_1 | abc | ijk | xyz | |
| niah_multikey_2 | abc | ijk | xyz | |
| niah_multikey_3 | abc | ijk | xyz | |
| Average Score | abc | ijk | xyz |
- Downloads last month
- 65
Inference Providers
NEW
This model isn't deployed by any Inference Provider.
๐
Ask for provider support
Model tree for nm-testing/Llama-3.1-8B-Instruct-KV-Cache-FP8
Base model
meta-llama/Llama-3.1-8B
Finetuned
meta-llama/Llama-3.1-8B-Instruct