Instructions to use nvidia/Llama-3_1-Nemotron-51B-Instruct with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use nvidia/Llama-3_1-Nemotron-51B-Instruct with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="nvidia/Llama-3_1-Nemotron-51B-Instruct", trust_remote_code=True)
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained("nvidia/Llama-3_1-Nemotron-51B-Instruct", trust_remote_code=True, dtype="auto")

Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use nvidia/Llama-3_1-Nemotron-51B-Instruct with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "nvidia/Llama-3_1-Nemotron-51B-Instruct"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "nvidia/Llama-3_1-Nemotron-51B-Instruct",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/nvidia/Llama-3_1-Nemotron-51B-Instruct

SGLang

How to use nvidia/Llama-3_1-Nemotron-51B-Instruct with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "nvidia/Llama-3_1-Nemotron-51B-Instruct" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "nvidia/Llama-3_1-Nemotron-51B-Instruct",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "nvidia/Llama-3_1-Nemotron-51B-Instruct" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "nvidia/Llama-3_1-Nemotron-51B-Instruct",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use nvidia/Llama-3_1-Nemotron-51B-Instruct with Docker Model Runner:
```
docker model run hf.co/nvidia/Llama-3_1-Nemotron-51B-Instruct
```

flash_attention_utils_backward_compat

by itlevy - opened Sep 24, 2024

base: refs/heads/main

←

from: refs/pr/2

Discussion Files changed

+20

-123

Files changed (5) hide show

NOTICE +0 -5
README.md +3 -5
modeling_decilm.py +6 -51
transformers_4_44_2__modeling_flash_attention_utils_backward_compat.py +2 -48
variable_cache.py +9 -14

NOTICE DELETED Viewed

@@ -1,5 +0,0 @@
-Copyright (c) 2024 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
-NVIDIA CORPORATION, its affiliates and licensors retain all intellectual property and proprietary rights in and to this material, related documentation and any modifications thereto. Any use, reproduction, disclosure or distribution of this material and related documentation without an express license agreement from NVIDIA CORPORATION or its affiliates is strictly prohibited.
-Llama 3.1 is licensed under the Llama 3.1 Community License, Copyright © Meta Platforms, Inc. All Rights Reserved.

README.md CHANGED Viewed

@@ -8,9 +8,9 @@ tags:
   - llama-3
   - pytorch
 license: other
-license_name: nvidia-open-model-license
 license_link: >-
-  https://developer.download.nvidia.com/licenses/nvidia-open-model-license-agreement-june-2024.pdf
 ---
 # Llama-3_1-Nemotron-51B-instruct
@@ -22,8 +22,7 @@ Llama-3_1-Nemotron-51B-instruct is a model which offers a great tradeoff between
 ## License
-Your use of this model is governed by the [NVIDIA Open Model License](https://developer.download.nvidia.com/licenses/nvidia-open-model-license-agreement-june-2024.pdf).
-Additional Information: [Llama 3.1 Community License Agreement](https://www.llama.com/llama3_1/license/). Built with Llama.
 ## How was the model developed
@@ -33,7 +32,6 @@ The KD step included 40 billion tokens consisting of a mixture of 3 datasets - F
 Links to [NIM](https://build.nvidia.com/nvidia/llama-3_1-nemotron-51b-instruct), [blog](https://developer.nvidia.com/blog/advancing-the-accuracy-efficiency-frontier-with-llama-3-1-nemotron-51b/) and [huggingface](https://huggingface.co/nvidia/Llama-3_1-Nemotron-51B-Instruct)
 This results in a final model that is aligned for human chat preferences.
 **Model Developers:** NVIDIA

   - llama-3
   - pytorch
 license: other
+license_name: nvidia-ai-foundation-models-community-license
 license_link: >-
+  https://www.nvidia.com/en-us/agreements/enterprise-software/nvidia-ai-foundation-models-community-license-agreement/
 ---
 # Llama-3_1-Nemotron-51B-instruct
 ## License
+[NVIDIA AI Foundation Models Community License Agreement](https://www.nvidia.com/en-us/agreements/enterprise-software/nvidia-ai-foundation-models-community-license-agreement/). Additional Information: [Llama 3.1 Community License Agreement](https://www.llama.com/llama3_1/license/). Built with Llama.
 ## How was the model developed
 Links to [NIM](https://build.nvidia.com/nvidia/llama-3_1-nemotron-51b-instruct), [blog](https://developer.nvidia.com/blog/advancing-the-accuracy-efficiency-frontier-with-llama-3-1-nemotron-51b/) and [huggingface](https://huggingface.co/nvidia/Llama-3_1-Nemotron-51B-Instruct)
 This results in a final model that is aligned for human chat preferences.
 **Model Developers:** NVIDIA

modeling_decilm.py CHANGED Viewed

@@ -25,7 +25,7 @@ import torch.utils.checkpoint
 from torch import nn
 from torch.nn import BCEWithLogitsLoss, CrossEntropyLoss, MSELoss
 from transformers import GenerationConfig
-from transformers.generation.utils import NEED_SETUP_CACHE_CLASSES_MAPPING, GenerationMixin, GenerateOutput
 from transformers.modeling_utils import PreTrainedModel
 from transformers.utils import (
     add_start_docstrings,
@@ -385,6 +385,7 @@ class DeciLMAttention(nn.Module):
             **kwargs,
     ) -> Tuple[torch.Tensor, Optional[torch.Tensor], Optional[Tuple[torch.Tensor]]]:
         bsz, q_len, _ = hidden_states.size()
         if self.config.pretraining_tp > 1:
             key_value_slicing = (self.num_key_value_heads * self.head_dim) // self.config.pretraining_tp
             query_slices = self.q_proj.weight.split(
@@ -496,6 +497,7 @@ class DeciLMFlashAttention2(DeciLMAttention):
                 "`static` cache implementation is not compatible with `attn_implementation==flash_attention_2` "
                 "make sure to use `sdpa` in the mean time, and open an issue at https://github.com/huggingface/transformers"
             )
         output_attentions = False
         bsz, q_len, _ = hidden_states.size()
@@ -833,13 +835,10 @@ class DeciLMPreTrainedModel(PreTrainedModel):
                 module.weight.data[module.padding_idx].zero_()
     def _prepare_generation_config(
-            self,
-            generation_config: Optional[GenerationConfig],
-            *args,
-            **kwargs,
     ) -> tuple[GenerationConfig, dict]:
         # DeciLM-specific code
-        generation_config, model_kwargs = super()._prepare_generation_config(generation_config, *args, **kwargs)
         generation_config.cache_implementation = "variable"
         NEED_SETUP_CACHE_CLASSES_MAPPING["variable"] = VariableCache
         return generation_config, model_kwargs
@@ -1134,7 +1133,7 @@ class DeciLMModel(DeciLMPreTrainedModel):
         return causal_mask
-class DeciLMForCausalLM(DeciLMPreTrainedModel, GenerationMixin):
     _tied_weights_keys = ["lm_head.weight"]
     def __init__(self, config):
@@ -1314,50 +1313,6 @@ class DeciLMForCausalLM(DeciLMPreTrainedModel, GenerationMixin):
         )
         return model_inputs
-    def _maybe_initialize_input_ids_for_generation(
-            self,
-            inputs: Optional[torch.Tensor] = None,
-            bos_token_id: Optional[torch.Tensor] = None,
-            model_kwargs: Optional[dict[str, torch.Tensor]] = None,
-    ) -> torch.LongTensor:
-        """
-        Patching hf bug that creates wrong cache length if only inputs_embeds are passed to the model
-        """
-        input_ids = super()._maybe_initialize_input_ids_for_generation(
-            inputs=inputs, bos_token_id=bos_token_id, model_kwargs=model_kwargs)
-        if (
-                "inputs_embeds" in model_kwargs
-                and input_ids is not None
-                and input_ids.shape[1] == 0
-        ):
-            batch_size, input_sequence_length = model_kwargs["inputs_embeds"].shape[:2]
-            input_ids = torch.zeros((batch_size, input_sequence_length), dtype=torch.long, device=self.device)
-        return input_ids
-    def generate(
-            self,
-            inputs: Optional[torch.Tensor] = None,
-            *args,
-            **kwargs,
-    ) -> Union[GenerateOutput, torch.LongTensor]:
-        """
-        Patching hf bug that creates wrong cache length if only inputs_embeds are passed to the model
-        """
-        only_passed_inputs_embeds = (
-                "inputs_embeds" in kwargs and
-                "input_ids" not in kwargs and
-                inputs is None
-        )
-        if only_passed_inputs_embeds:
-            input_sequence_length = kwargs["inputs_embeds"].shape[1]
-        generation_output = super().generate(inputs=inputs, *args, **kwargs)
-        if only_passed_inputs_embeds and isinstance(generation_output, torch.Tensor):
-            generation_output = generation_output[:, input_sequence_length:]
-        return generation_output
 @add_start_docstrings(
     """

 from torch import nn
 from torch.nn import BCEWithLogitsLoss, CrossEntropyLoss, MSELoss
 from transformers import GenerationConfig
+from transformers.generation.utils import NEED_SETUP_CACHE_CLASSES_MAPPING
 from transformers.modeling_utils import PreTrainedModel
 from transformers.utils import (
     add_start_docstrings,
             **kwargs,
     ) -> Tuple[torch.Tensor, Optional[torch.Tensor], Optional[Tuple[torch.Tensor]]]:
         bsz, q_len, _ = hidden_states.size()
         if self.config.pretraining_tp > 1:
             key_value_slicing = (self.num_key_value_heads * self.head_dim) // self.config.pretraining_tp
             query_slices = self.q_proj.weight.split(
                 "`static` cache implementation is not compatible with `attn_implementation==flash_attention_2` "
                 "make sure to use `sdpa` in the mean time, and open an issue at https://github.com/huggingface/transformers"
             )
         output_attentions = False
         bsz, q_len, _ = hidden_states.size()
                 module.weight.data[module.padding_idx].zero_()
     def _prepare_generation_config(
+            self, generation_config: Optional[GenerationConfig], **kwargs: dict
     ) -> tuple[GenerationConfig, dict]:
         # DeciLM-specific code
+        generation_config, model_kwargs = super()._prepare_generation_config(generation_config, **kwargs)
         generation_config.cache_implementation = "variable"
         NEED_SETUP_CACHE_CLASSES_MAPPING["variable"] = VariableCache
         return generation_config, model_kwargs
         return causal_mask
+class DeciLMForCausalLM(DeciLMPreTrainedModel):
     _tied_weights_keys = ["lm_head.weight"]
     def __init__(self, config):
         )
         return model_inputs
 @add_start_docstrings(
     """

transformers_4_44_2__modeling_flash_attention_utils_backward_compat.py CHANGED Viewed

@@ -15,18 +15,12 @@
 import inspect
 import os
-from typing import Optional, Tuple, Union
 import torch
 import torch.nn.functional as F
-from functools import lru_cache
-import importlib.metadata
-import importlib.util
-from packaging import version
-from transformers.utils import is_flash_attn_2_available
 if is_flash_attn_2_available():
@@ -38,46 +32,6 @@ if is_flash_attn_2_available():
         raise "Unable to import flash_attn"
-def _is_package_available(pkg_name: str, return_version: bool = False) -> Union[Tuple[bool, str], bool]:
-    # Check if the package spec exists and grab its version to avoid importing a local directory
-    package_exists = importlib.util.find_spec(pkg_name) is not None
-    package_version = "N/A"
-    if package_exists:
-        try:
-            # Primary method to get the package version
-            package_version = importlib.metadata.version(pkg_name)
-        except importlib.metadata.PackageNotFoundError:
-            # Fallback method: Only for "torch" and versions containing "dev"
-            if pkg_name == "torch":
-                try:
-                    package = importlib.import_module(pkg_name)
-                    temp_version = getattr(package, "__version__", "N/A")
-                    # Check if the version contains "dev"
-                    if "dev" in temp_version:
-                        package_version = temp_version
-                        package_exists = True
-                    else:
-                        package_exists = False
-                except ImportError:
-                    # If the package can't be imported, it's not available
-                    package_exists = False
-            else:
-                # For packages other than "torch", don't attempt the fallback and set as not available
-                package_exists = False
-    if return_version:
-        return package_exists, package_version
-    else:
-        return package_exists
-@lru_cache()
-def is_flash_attn_greater_or_equal(library_version: str):
-    if not _is_package_available("flash_attn"):
-        return False
-    return version.parse(importlib.metadata.version("flash_attn")) >= version.parse(library_version)
 def _get_unpad_data(attention_mask: torch.Tensor) -> Tuple[torch.Tensor, torch.Tensor, int]:
     """
     Retrieves indexing data required to repad unpadded (ragged) tensors.

 import inspect
 import os
+from typing import Optional, Tuple
 import torch
 import torch.nn.functional as F
+from transformers.utils import is_flash_attn_2_available, is_flash_attn_greater_or_equal
 if is_flash_attn_2_available():
         raise "Unable to import flash_attn"
 def _get_unpad_data(attention_mask: torch.Tensor) -> Tuple[torch.Tensor, torch.Tensor, int]:
     """
     Retrieves indexing data required to repad unpadded (ragged) tensors.

variable_cache.py CHANGED Viewed

@@ -32,21 +32,17 @@ class VariableCache(Cache_4_44_2, Cache):
     The cache of each layer is allocated to the same gpu as the layer itself.
     """
-    def __init__(
-            self,
-            *,  # key-word only, no positional args allowed to avoid mix-ups with newer transformers versions
-            config: DeciLMConfig,
-            batch_size: int = None,
-            max_cache_len: int = None,
-            dtype: torch.dtype = torch.float32,
-            max_batch_size: Optional[int] = None,
-            **kwargs: Any,
-    ) -> None:
         Cache_4_44_2.__init__(self)
-        self.config = deepcopy(config)
-        self.max_batch_size = batch_size or max_batch_size
-        self.batch_size = self.max_batch_size
         self.max_cache_len = config.max_position_embeddings if max_cache_len is None else max_cache_len
         self.dtype = dtype
@@ -83,7 +79,6 @@ class VariableCache(Cache_4_44_2, Cache):
         if attention_config.no_op or attention_config.replace_with_linear:
             return None
         config = deepcopy(self.config)
-        config.num_hidden_layers = 1
         config.num_key_value_heads = self.config.num_attention_heads // attention_config.n_heads_in_group
         return StaticCache(config, self.max_batch_size, self.max_cache_len, device, self.dtype)

     The cache of each layer is allocated to the same gpu as the layer itself.
     """
+    def __init__(self,
+                 config: DeciLMConfig,
+                 max_batch_size: int,
+                 max_cache_len: int | None,
+                 device: torch.device | str | None = None,
+                 dtype: torch.dtype | None = None,
+                 ):
         Cache_4_44_2.__init__(self)
+        self.config = config
+        self.max_batch_size = max_batch_size
         self.max_cache_len = config.max_position_embeddings if max_cache_len is None else max_cache_len
         self.dtype = dtype
         if attention_config.no_op or attention_config.replace_with_linear:
             return None
         config = deepcopy(self.config)
         config.num_key_value_heads = self.config.num_attention_heads // attention_config.n_heads_in_group
         return StaticCache(config, self.max_batch_size, self.max_cache_len, device, self.dtype)