Instructions to use Efficient-Large-Model/NVILA-15B with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use Efficient-Large-Model/NVILA-15B with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="Efficient-Large-Model/NVILA-15B")# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("Efficient-Large-Model/NVILA-15B", dtype="auto") - Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use Efficient-Large-Model/NVILA-15B with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "Efficient-Large-Model/NVILA-15B" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "Efficient-Large-Model/NVILA-15B", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker
docker model run hf.co/Efficient-Large-Model/NVILA-15B
- SGLang
How to use Efficient-Large-Model/NVILA-15B with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "Efficient-Large-Model/NVILA-15B" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "Efficient-Large-Model/NVILA-15B", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "Efficient-Large-Model/NVILA-15B" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "Efficient-Large-Model/NVILA-15B", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }' - Docker Model Runner
How to use Efficient-Large-Model/NVILA-15B with Docker Model Runner:
docker model run hf.co/Efficient-Large-Model/NVILA-15B
Ask about demo
Is there any demo available to help me use this model? For example, how to load the model, process the input, and obtain the answers. Thank you!
Hi, you can refer to our official github repo. Here shows how you can run inference with the model: https://github.com/NVlabs/VILA?tab=readme-ov-file#inference.
Hi, you can refer to our official github repo. Here shows how you can run inference with the model: https://github.com/NVlabs/VILA?tab=readme-ov-file#inference.
Thank you for your answer. Is there a Python code demo instead of a shell script, so it can be used more flexibly?
You may refer to https://github.com/NVlabs/VILA?tab=readme-ov-file#inference
Is there a Python code demo instead of a shell script, so it can be used more flexibly?
Hi, you can refer to this file (https://github.com/NVlabs/Cosmos-Nemotron/blob/main/llava/cli/infer.py). This is the python script it calls when we run vila-infer.