RedHatAI
/

Meta-Llama-3.1-405B-Instruct-FP8

@@ -33,7 +33,7 @@ base_model: meta-llama/Meta-Llama-3.1-405B-Instruct
 - **Model Developers:** Neural Magic
 Quantized version of [Meta-Llama-3.1-405B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3.1-405B-Instruct) with the updated 8 kv-heads.
-It achieves an average score of 86.60 on the [OpenLLM](https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard) benchmark (version 1), whereas the unquantized model achieves 86.79.
 ### Model Optimizations
@@ -118,11 +118,11 @@ model_stub = "meta-llama/Meta-Llama-3.1-405B-Instruct"
 model_name = model_stub.split("/")[-1]
 device_map = calculate_offload_device_map(
-    model_stub, reserve_for_hessians=False, num_gpus=8, torch_dtype=torch.float16
 )
 model = SparseAutoModelForCausalLM.from_pretrained(
-    model_stub, torch_dtype=torch.float16, device_map=device_map
 )
 tokenizer = AutoTokenizer.from_pretrained(model_stub)
@@ -193,9 +193,9 @@ This version of the lm-evaluation-harness includes versions of ARC-Challenge, GS
    </td>
    <td>87.41
    </td>
-   <td>87.05
    </td>
-   <td>99.59%
    </td>
   </tr>
   <tr>
@@ -203,9 +203,9 @@ This version of the lm-evaluation-harness includes versions of ARC-Challenge, GS
    </td>
    <td>88.11
    </td>
-   <td>87.87
    </td>
-   <td>99.73%
    </td>
   </tr>
   <tr>
@@ -213,9 +213,9 @@ This version of the lm-evaluation-harness includes versions of ARC-Challenge, GS
    </td>
    <td>94.97
    </td>
-   <td>94.97
    </td>
-   <td>100.0%
    </td>
   </tr>
   <tr>
@@ -223,9 +223,9 @@ This version of the lm-evaluation-harness includes versions of ARC-Challenge, GS
    </td>
    <td>95.98
    </td>
-   <td>95.83
    </td>
-   <td>99.84%
    </td>
   </tr>
   <tr>
@@ -233,9 +233,9 @@ This version of the lm-evaluation-harness includes versions of ARC-Challenge, GS
    </td>
    <td>88.54
    </td>
-   <td>88.11
    </td>
-   <td>99.51%
    </td>
   </tr>
   <tr>
@@ -243,9 +243,9 @@ This version of the lm-evaluation-harness includes versions of ARC-Challenge, GS
    </td>
    <td>87.21
    </td>
-   <td>87.77
    </td>
-   <td>100.6%
    </td>
   </tr>
   <tr>
@@ -253,9 +253,9 @@ This version of the lm-evaluation-harness includes versions of ARC-Challenge, GS
    </td>
    <td>65.31
    </td>
-   <td>64.58
    </td>
-   <td>98.88%
    </td>
   </tr>
   <tr>
@@ -263,9 +263,9 @@ This version of the lm-evaluation-harness includes versions of ARC-Challenge, GS
    </td>
    <td><strong>86.79</strong>
    </td>
-   <td><strong>86.60</strong>
    </td>
-   <td><strong>99.74%</strong>
    </td>
   </tr>
 </table>

 - **Model Developers:** Neural Magic
 Quantized version of [Meta-Llama-3.1-405B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3.1-405B-Instruct) with the updated 8 kv-heads.
+It achieves an average score of 86.78 on the [OpenLLM](https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard) benchmark (version 1), whereas the unquantized model achieves 86.79.
 ### Model Optimizations
 model_name = model_stub.split("/")[-1]
 device_map = calculate_offload_device_map(
+    model_stub, reserve_for_hessians=False, num_gpus=8, torch_dtype="auto"
 )
 model = SparseAutoModelForCausalLM.from_pretrained(
+    model_stub, torch_dtype="auto", device_map=device_map
 )
 tokenizer = AutoTokenizer.from_pretrained(model_stub)
    </td>
    <td>87.41
    </td>
+   <td>87.41
    </td>
+   <td>100.0%
    </td>
   </tr>
   <tr>
    </td>
    <td>88.11
    </td>
+   <td>88.02
    </td>
+   <td>99.90%
    </td>
   </tr>
   <tr>
    </td>
    <td>94.97
    </td>
+   <td>94.88
    </td>
+   <td>99.91%
    </td>
   </tr>
   <tr>
    </td>
    <td>95.98
    </td>
+   <td>96.29
    </td>
+   <td>100.3%
    </td>
   </tr>
   <tr>
    </td>
    <td>88.54
    </td>
+   <td>88.54
    </td>
+   <td>100.0%
    </td>
   </tr>
   <tr>
    </td>
    <td>87.21
    </td>
+   <td>86.98
    </td>
+   <td>99.74%
    </td>
   </tr>
   <tr>
    </td>
    <td>65.31
    </td>
+   <td>65.33
    </td>
+   <td>100.0%
    </td>
   </tr>
   <tr>
    </td>
    <td><strong>86.79</strong>
    </td>
+   <td><strong>86.78</strong>
    </td>
+   <td><strong>99.99%</strong>
    </td>
   </tr>
 </table>