--- language: - en - fr - de - es - it - pt - nl - hi license: apache-2.0 library_name: vllm inference: false base_model: - mistralai/Mistral-Small-24B-Base-2501 extra_gated_description: >- If you want to learn more about how we process your personal data, please read our Privacy Policy. pipeline_tag: audio-text-to-text --- # Voxtral Small 24B - 2507 (Transformers Edition) Voxtral Small is an enhancement of [Mistral Small 3](https://huggingface.co/mistralai/Mistral-Small-24B-Base-2501), incorporating state-of-the-art audio input capabilities while retaining best-in-class text performance. It excels at speech transcription, translation and audio understanding. Learn more about Voxtral in our blog post [here](https://mistral.ai/news/voxtral). ## Key Features Voxtral builds upon Mistral Small 3 with powerful audio understanding capabilities. - **Dedicated transcription mode**: Voxtral can operate in a pure speech transcription mode to maximize performance. By default, Voxtral automatically predicts the source audio language and transcribes the text accordingly - **Long-form context**: With a 32k token context length, Voxtral handles audios up to 30 minutes for transcription, or 40 minutes for understanding - **Built-in Q&A and summarization**: Supports asking questions directly through audio. Analyze audio and generate structured summaries without the need for separate ASR and language models - **Natively multilingual**: Automatic language detection and state-of-the-art performance in the world’s most widely used languages (English, Spanish, French, Portuguese, Hindi, German, Dutch, Italian) - **Function-calling straight from voice**: Enables direct triggering of backend functions, workflows, or API calls based on spoken user intents - **Highly capable at text**: Retains the text understanding capabilities of its language model backbone, Mistral Small 3.1 ## Benchmark Results ### Audio Average word error rate (WER) over the FLEURS, Mozilla Common Voice and Multilingual LibriSpeech benchmarks: ![image/png](https://cdn-uploads.huggingface.co/production/uploads/64161701107962562e9b1006/puASxtajF1lDeGYPrRK5y.png) ### Text ![image/png](https://cdn-uploads.huggingface.co/production/uploads/5dfcb1aada6d0311fd3d5448/uDg3hKDwJowsNuj-yyt2T.png) ## Usage The model can be used with the following frameworks; - [`Transformers` 🤗](https://github.com/huggingface/transformers): See [here](#transformers-🤗) **Notes**: - `temperature=0.2` and `top_p=0.95` for chat completion (*e.g. Audio Understanding*) and `temperature=0.0` for transcription - Multiple audios per message and multiple user turns with audio are supported - Function calling is supported - System prompts are not yet supported ### Transformers 🤗 Voxtral is supported in Transformers natively! Install Transformers from source: ```bash pip install git+https://github.com/huggingface/transformers ``` #### Audio Instruct
➡️ multi-audio + text instruction ```python from transformers import VoxtralForConditionalGeneration, AutoProcessor import torch device = "cuda" repo_id = "MohamedRashad/Voxtral-Small-24B-2507-tranbsformers" processor = AutoProcessor.from_pretrained(repo_id) model = VoxtralForConditionalGeneration.from_pretrained(repo_id, torch_dtype=torch.bfloat16, device_map=device) conversation = [ { "role": "user", "content": [ { "type": "audio", "path": "https://huggingface.co/datasets/hf-internal-testing/dummy-audio-samples/resolve/main/mary_had_lamb.mp3", }, { "type": "audio", "path": "https://huggingface.co/datasets/hf-internal-testing/dummy-audio-samples/resolve/main/winning_call.mp3", }, {"type": "text", "text": "What sport and what nursery rhyme are referenced?"}, ], } ] inputs = processor.apply_chat_template(conversation) inputs = inputs.to(device, dtype=torch.bfloat16) outputs = model.generate(**inputs, max_new_tokens=500) decoded_outputs = processor.batch_decode(outputs[:, inputs.input_ids.shape[1]:], skip_special_tokens=True) print("\nGenerated response:") print("=" * 80) print(decoded_outputs[0]) print("=" * 80) ```
➡️ multi-turn ```python from transformers import VoxtralForConditionalGeneration, AutoProcessor import torch device = "cuda" repo_id = "MohamedRashad/Voxtral-Small-24B-2507-tranbsformers" processor = AutoProcessor.from_pretrained(repo_id) model = VoxtralForConditionalGeneration.from_pretrained(repo_id, torch_dtype=torch.bfloat16, device_map=device) conversation = [ { "role": "user", "content": [ { "type": "audio", "path": "https://huggingface.co/datasets/hf-internal-testing/dummy-audio-samples/resolve/main/obama.mp3", }, { "type": "audio", "path": "https://huggingface.co/datasets/hf-internal-testing/dummy-audio-samples/resolve/main/bcn_weather.mp3", }, {"type": "text", "text": "Describe briefly what you can hear."}, ], }, { "role": "assistant", "content": "The audio begins with the speaker delivering a farewell address in Chicago, reflecting on his eight years as president and expressing gratitude to the American people. The audio then transitions to a weather report, stating that it was 35 degrees in Barcelona the previous day, but the temperature would drop to minus 20 degrees the following day.", }, { "role": "user", "content": [ { "type": "audio", "path": "https://huggingface.co/datasets/hf-internal-testing/dummy-audio-samples/resolve/main/winning_call.mp3", }, {"type": "text", "text": "Ok, now compare this new audio with the previous one."}, ], }, ] inputs = processor.apply_chat_template(conversation) inputs = inputs.to(device, dtype=torch.bfloat16) outputs = model.generate(**inputs, max_new_tokens=500) decoded_outputs = processor.batch_decode(outputs[:, inputs.input_ids.shape[1]:], skip_special_tokens=True) print("\nGenerated response:") print("=" * 80) print(decoded_outputs[0]) print("=" * 80) ```
➡️ text only ```python from transformers import VoxtralForConditionalGeneration, AutoProcessor import torch device = "cuda" repo_id = "MohamedRashad/Voxtral-Small-24B-2507-tranbsformers" processor = AutoProcessor.from_pretrained(repo_id) model = VoxtralForConditionalGeneration.from_pretrained(repo_id, torch_dtype=torch.bfloat16, device_map=device) conversation = [ { "role": "user", "content": [ { "type": "text", "text": "Why should AI models be open-sourced?", }, ], } ] inputs = processor.apply_chat_template(conversation) inputs = inputs.to(device, dtype=torch.bfloat16) outputs = model.generate(**inputs, max_new_tokens=500) decoded_outputs = processor.batch_decode(outputs[:, inputs.input_ids.shape[1]:], skip_special_tokens=True) print("\nGenerated response:") print("=" * 80) print(decoded_outputs[0]) print("=" * 80) ```
➡️ audio only ```python from transformers import VoxtralForConditionalGeneration, AutoProcessor import torch device = "cuda" repo_id = "MohamedRashad/Voxtral-Small-24B-2507-tranbsformers" processor = AutoProcessor.from_pretrained(repo_id) model = VoxtralForConditionalGeneration.from_pretrained(repo_id, torch_dtype=torch.bfloat16, device_map=device) conversation = [ { "role": "user", "content": [ { "type": "audio", "path": "https://huggingface.co/datasets/hf-internal-testing/dummy-audio-samples/resolve/main/winning_call.mp3", }, ], } ] inputs = processor.apply_chat_template(conversation) inputs = inputs.to(device, dtype=torch.bfloat16) outputs = model.generate(**inputs, max_new_tokens=500) decoded_outputs = processor.batch_decode(outputs[:, inputs.input_ids.shape[1]:], skip_special_tokens=True) print("\nGenerated response:") print("=" * 80) print(decoded_outputs[0]) print("=" * 80) ```
➡️ batched inference ```python from transformers import VoxtralForConditionalGeneration, AutoProcessor import torch device = "cuda" repo_id = "MohamedRashad/Voxtral-Small-24B-2507-tranbsformers" processor = AutoProcessor.from_pretrained(repo_id) model = VoxtralForConditionalGeneration.from_pretrained(repo_id, torch_dtype=torch.bfloat16, device_map=device) conversations = [ [ { "role": "user", "content": [ { "type": "audio", "path": "https://huggingface.co/datasets/hf-internal-testing/dummy-audio-samples/resolve/main/obama.mp3", }, { "type": "audio", "path": "https://huggingface.co/datasets/hf-internal-testing/dummy-audio-samples/resolve/main/bcn_weather.mp3", }, { "type": "text", "text": "Who's speaking in the speach and what city's weather is being discussed?", }, ], } ], [ { "role": "user", "content": [ { "type": "audio", "path": "https://huggingface.co/datasets/hf-internal-testing/dummy-audio-samples/resolve/main/winning_call.mp3", }, {"type": "text", "text": "What can you tell me about this audio?"}, ], } ], ] inputs = processor.apply_chat_template(conversations) inputs = inputs.to(device, dtype=torch.bfloat16) outputs = model.generate(**inputs, max_new_tokens=500) decoded_outputs = processor.batch_decode(outputs[:, inputs.input_ids.shape[1]:], skip_special_tokens=True) print("\nGenerated responses:") print("=" * 80) for decoded_output in decoded_outputs: print(decoded_output) print("=" * 80) ```
#### Transcription
➡️ transcribe ```python from transformers import VoxtralForConditionalGeneration, AutoProcessor import torch device = "cuda" repo_id = "MohamedRashad/Voxtral-Small-24B-2507-tranbsformers" processor = AutoProcessor.from_pretrained(repo_id) model = VoxtralForConditionalGeneration.from_pretrained(repo_id, torch_dtype=torch.bfloat16, device_map=device) inputs = processor.apply_transcrition_request(language="en", audio="https://huggingface.co/datasets/hf-internal-testing/dummy-audio-samples/resolve/main/obama.mp3", model_id=repo_id) inputs = inputs.to(device, dtype=torch.bfloat16) outputs = model.generate(**inputs, max_new_tokens=500) decoded_outputs = processor.batch_decode(outputs[:, inputs.input_ids.shape[1]:], skip_special_tokens=True) print("\nGenerated responses:") print("=" * 80) for decoded_output in decoded_outputs: print(decoded_output) print("=" * 80) ```