metadata
license: llama3.1
tags:
- gguf
- llama3
pipeline_tag: text-generation
datasets:
- froggeric/imatrix
language:
- en
library_name: ggml
Meta-Llama-3.1-405B-Instruct-GGUF
Low bit quantizations of Meta's Llama 3.1 405B Instruct model. Quantized from ollama q4_0 GGUF.
Quantized with llama.cpp b3449
| Quant | Notes |
|---|---|
| BF16 | Brain floating point, very high quality, smaller than F16 |
| Q8_0 | 8-bit quantization, high quality, larger size |
| Q6_K | 6-bit quantization, very good quality-to-size ratio |
| Q5_K | 5-bit quantization, good balance of quality and size |
| Q5_0 | Alternative 5-bit quantization, slightly different balance |
| Q4_K_M | 4-bit quantization, good for production use |
| Q4_K_S | 4-bit quantization, faster inference, efficient for scaling |
| Q4_0 | Basic 4-bit quantization, good for experimentation |
| Q3_K_L | 3-bit quantization, high-quality with more VRAM requirement |
| Q3_K_M | 3-bit quantization, good balance between speed and accuracy |
| Q3_K_S | 3-bit quantization, faster inference with minor quality loss |
| Q2_K | 2-bit quantization, suitable for general inference tasks |
| IQ2_S | Integer 2-bit quantization, optimized for small VRAM environments |
| IQ2_XXS | Integer 2-bit quantization, best for ultra-low memory footprint |
| IQ1_M | Integer 1-bit quantization, usable |
| IQ1_S | Integer 1-bit quantization, not recommended |
For higher quality quantizations (q4+), please refer to nisten/meta-405b-instruct-cpu-optimized-gguf.
Regarding the smaug-bpe tokenizer, this doesn't make a difference (they are identical). However, if you have concerns you can use the following command to set the llama-bpe tokenizer:
./gguf-py/scripts/gguf_new_metadata.py --pre-tokenizer "llama-bpe" Llama-3.1-405B-Instruct-old.gguf LLama-3.1-405B-Instruct-fixed.gguf
imatrix
Generated from Q2_K quant.
imatrix calibration data: groups_merged.txt
