patrickvonplaten commited on
Commit
96ca933
verified
0 Parent(s):

Super-squash branch 'main' using huggingface_hub

Browse files
Files changed (5) hide show
  1. .gitattributes +36 -0
  2. README.md +246 -0
  3. consolidated.safetensors +3 -0
  4. params.json +34 -0
  5. tekken.json +3 -0
.gitattributes ADDED
@@ -0,0 +1,36 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ *.7z filter=lfs diff=lfs merge=lfs -text
2
+ *.arrow filter=lfs diff=lfs merge=lfs -text
3
+ *.bin filter=lfs diff=lfs merge=lfs -text
4
+ *.bz2 filter=lfs diff=lfs merge=lfs -text
5
+ *.ckpt filter=lfs diff=lfs merge=lfs -text
6
+ *.ftz filter=lfs diff=lfs merge=lfs -text
7
+ *.gz filter=lfs diff=lfs merge=lfs -text
8
+ *.h5 filter=lfs diff=lfs merge=lfs -text
9
+ *.joblib filter=lfs diff=lfs merge=lfs -text
10
+ *.lfs.* filter=lfs diff=lfs merge=lfs -text
11
+ *.mlmodel filter=lfs diff=lfs merge=lfs -text
12
+ *.model filter=lfs diff=lfs merge=lfs -text
13
+ *.msgpack filter=lfs diff=lfs merge=lfs -text
14
+ *.npy filter=lfs diff=lfs merge=lfs -text
15
+ *.npz filter=lfs diff=lfs merge=lfs -text
16
+ *.onnx filter=lfs diff=lfs merge=lfs -text
17
+ *.ot filter=lfs diff=lfs merge=lfs -text
18
+ *.parquet filter=lfs diff=lfs merge=lfs -text
19
+ *.pb filter=lfs diff=lfs merge=lfs -text
20
+ *.pickle filter=lfs diff=lfs merge=lfs -text
21
+ *.pkl filter=lfs diff=lfs merge=lfs -text
22
+ *.pt filter=lfs diff=lfs merge=lfs -text
23
+ *.pth filter=lfs diff=lfs merge=lfs -text
24
+ *.rar filter=lfs diff=lfs merge=lfs -text
25
+ *.safetensors filter=lfs diff=lfs merge=lfs -text
26
+ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
27
+ *.tar.* filter=lfs diff=lfs merge=lfs -text
28
+ *.tar filter=lfs diff=lfs merge=lfs -text
29
+ *.tflite filter=lfs diff=lfs merge=lfs -text
30
+ *.tgz filter=lfs diff=lfs merge=lfs -text
31
+ *.wasm filter=lfs diff=lfs merge=lfs -text
32
+ *.xz filter=lfs diff=lfs merge=lfs -text
33
+ *.zip filter=lfs diff=lfs merge=lfs -text
34
+ *.zst filter=lfs diff=lfs merge=lfs -text
35
+ *tfevents* filter=lfs diff=lfs merge=lfs -text
36
+ tekken.json filter=lfs diff=lfs merge=lfs -text
README.md ADDED
@@ -0,0 +1,246 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - en
4
+ - fr
5
+ - de
6
+ - es
7
+ - it
8
+ - pt
9
+ - nl
10
+ - hi
11
+ license: apache-2.0
12
+ library_name: vllm
13
+ inference: false
14
+ extra_gated_description: >-
15
+ If you want to learn more about how we process your personal data, please read
16
+ our <a href="https://mistral.ai/terms/">Privacy Policy</a>.
17
+ pipeline_tag: audio-text-to-text
18
+ ---
19
+ # Voxtral Mini 1.0 (3B) - 2507
20
+
21
+ Voxtral Mini is an enhancement of [Ministral 3B](https://mistral.ai/news/ministraux), incorporating state-of-the-art audio input capabilities while retaining best-in-class text performance. It excels at speech transcription, translation and audio understanding.
22
+
23
+ Learn more about Voxtral in our blog post [here](https://mistral.ai/news/voxtral-2507).
24
+
25
+ ## Key Features
26
+
27
+ Voxtral builds upon Ministral-3B with powerful audio understanding capabilities.
28
+ - **Dedicated transcription mode**: Voxtral can operate in a pure speech transcription mode to maximize performance. By default, Voxtral automatically predicts the source audio language and transcribes the text accordingly
29
+ - **Long-form context**: With a 32k token context length, Voxtral handles audios up to 30 minutes for transcription, or 40 minutes for understanding
30
+ - **Built-in Q&A and summarization**: Supports asking questions directly through audio. Analyze audio and generate structured summaries without the need for separate ASR and language models
31
+ - **Natively multilingual**: Automatic language detection and state-of-the-art performance in the world鈥檚 most widely used languages (English, Spanish, French, Portuguese, Hindi, German, Dutch, Italian)
32
+ - **Function-calling straight from voice**: Enables direct triggering of backend functions, workflows, or API calls based on spoken user intents
33
+ - **Highly capable at text**: Retains the text understanding capabilities of its language model backbone, Ministral-3B
34
+
35
+ ## Benchmark Results
36
+
37
+ ### Audio
38
+
39
+ Average word error rate (WER) over the FLEURS, Mozilla Common Voice and Multilingual LibriSpeech benchmarks:
40
+
41
+ ![image/png](https://cdn-uploads.huggingface.co/production/uploads/64161701107962562e9b1006/puASxtajF1lDeGYPrRK5y.png)
42
+
43
+
44
+ ## Usage
45
+
46
+ The model can be used with the following frameworks;
47
+ - [`vllm (recommended)`](https://github.com/vllm-project/vllm): See [here](#vllm-recommended)
48
+
49
+ **Notes**:
50
+
51
+ - `temperature=0.2` and `top_p=0.95` for chat completion (*e.g. Audio Understanding*) and `temperature=0.0` for transcription
52
+ - Multiple audios per message and multiple user turns with audio are supported
53
+ - Function calling is supported
54
+ - System prompts are not yet supported
55
+
56
+ ## Usage
57
+
58
+ The model can be used with the following frameworks;
59
+ - [`vllm (recommended)`](https://github.com/vllm-project/vllm): See [here](#vllm-recommended)
60
+
61
+ **Recommended settings**:
62
+
63
+ - `temperature=0.2` and `top_p=0.95` for chat completion (*e.g. Audio Understanding*) and `temperature=0.0` for transcription
64
+ - Multiple audios per message and multiple user turns with audio are supported
65
+ - System prompts are not yet supported
66
+ - Function calling is not yet supported
67
+
68
+ ### vLLM (recommended)
69
+
70
+ We recommend using this model with [vLLM](https://github.com/vllm-project/vllm).
71
+
72
+ #### Installation
73
+
74
+ Make sure to install vllm from "main":
75
+
76
+ ```
77
+ pip install -U vllm\[audio\] \
78
+ --pre \
79
+ --extra-index-url https://wheels.vllm.ai/nightly
80
+ ```
81
+
82
+ Doing so should automatically install [`mistral_common >= 1.8.0`](https://github.com/mistralai/mistral-common/releases/tag/v1.8.0).
83
+
84
+ To check:
85
+ ```
86
+ python -c "import mistral_common; print(mistral_common.__version__)"
87
+ ```
88
+
89
+ #### Offline
90
+
91
+ You can test that your vLLM setup works as expected by cloning the vLLM repo:
92
+
93
+ ```sh
94
+ git clone https://github.com/vllm-project/vllm && cd vllm
95
+ ```
96
+
97
+ and then running:
98
+
99
+ ```sh
100
+ python examples/offline_inference/audio_language.py --num-audios 2 --model-type voxtral
101
+ ```
102
+
103
+ #### Serve
104
+
105
+ We recommend that you use Voxtral-Small-24B-2507 in a server/client setting.
106
+
107
+ 1. Spin up a server:
108
+
109
+ ```
110
+ vllm serve mistralai/Voxtral-Mini-3B-2507 --tokenizer_mode mistral --config_format mistral --load_format mistral
111
+ ```
112
+
113
+ **Note:** Running Voxtral-Mini-3B-2507 on GPU requires ~9.5 GB of GPU RAM in bf16 or fp16.
114
+
115
+
116
+ 2. To ping the client you can use a simple Python snippet. See the following examples.
117
+
118
+
119
+ ### Audio Instruct
120
+
121
+ Leverage the audio capabilities of Voxtral-Mini-3B-2507 to chat.
122
+
123
+ Make sure that your client has `mistral-common` with audio installed:
124
+
125
+ ```sh
126
+ pip install --upgrade mistral_common\[audio\]
127
+ ```
128
+
129
+ <details>
130
+ <summary>Python snippet</summary>
131
+
132
+ ```py
133
+ from mistral_common.protocol.instruct.messages import TextChunk, AudioChunk, UserMessage, AssistantMessage, RawAudio
134
+ from mistral_common.audio import Audio
135
+ from huggingface_hub import hf_hub_download
136
+
137
+ from openai import OpenAI
138
+
139
+ # Modify OpenAI's API key and API base to use vLLM's API server.
140
+ openai_api_key = "EMPTY"
141
+ openai_api_base = "http://<your-server-host>:8000/v1"
142
+
143
+ client = OpenAI(
144
+ api_key=openai_api_key,
145
+ base_url=openai_api_base,
146
+ )
147
+
148
+ models = client.models.list()
149
+ model = models.data[0].id
150
+
151
+ obama_file = hf_hub_download("patrickvonplaten/audio_samples", "obama.mp3", repo_type="dataset")
152
+ bcn_file = hf_hub_download("patrickvonplaten/audio_samples", "bcn_weather.mp3", repo_type="dataset")
153
+
154
+ def file_to_chunk(file: str) -> AudioChunk:
155
+ audio = Audio.from_file(file, strict=False)
156
+ return AudioChunk.from_audio(audio)
157
+
158
+ text_chunk = TextChunk(text="Which speaker is more inspiring? Why? How are they different from each other? Answer in French.")
159
+ user_msg = UserMessage(content=[file_to_chunk(obama_file), file_to_chunk(bcn_file), text_chunk]).to_openai()
160
+
161
+ print(30 * "=" + "USER 1" + 30 * "=")
162
+ print(text_chunk.text)
163
+ print("\n\n")
164
+
165
+ response = client.chat.completions.create(
166
+ model=model,
167
+ messages=[user_msg],
168
+ temperature=0.2,
169
+ top_p=0.95,
170
+ )
171
+ content = response.choices[0].message.content
172
+
173
+ print(30 * "=" + "BOT 1" + 30 * "=")
174
+ print(content)
175
+ print("\n\n")
176
+ # The model could give the following answer:
177
+ # ```L'orateur le plus inspirant est le pr茅sident.
178
+ # Il est plus inspirant parce qu'il parle de ses exp茅riences personnelles
179
+ # et de son optimisme pour l'avenir du pays.
180
+ # Il est diff茅rent de l'autre orateur car il ne parle pas de la m茅t茅o,
181
+ # mais plut么t de ses interactions avec les gens et de son r么le en tant que pr茅sident.```
182
+
183
+ messages = [
184
+ user_msg,
185
+ AssistantMessage(content=content).to_openai(),
186
+ UserMessage(content="Ok, now please summarize the content of the first audio.").to_openai()
187
+ ]
188
+ print(30 * "=" + "USER 2" + 30 * "=")
189
+ print(messages[-1]["content"])
190
+ print("\n\n")
191
+
192
+ response = client.chat.completions.create(
193
+ model=model,
194
+ messages=messages,
195
+ temperature=0.2,
196
+ top_p=0.95,
197
+ )
198
+ content = response.choices[0].message.content
199
+ print(30 * "=" + "BOT 2" + 30 * "=")
200
+ print(content)
201
+ ```
202
+ </details>
203
+
204
+ #### Transcription
205
+
206
+ Voxtral-Mini-3B-2507 has powerful transcription capabilities!
207
+
208
+ Make sure that your client has `mistral-common` with audio installed:
209
+
210
+ ```sh
211
+ pip install --upgrade mistral_common\[audio\]
212
+ ```
213
+
214
+ <details>
215
+ <summary>Python snippet</summary>
216
+
217
+ ```python
218
+ from mistral_common.protocol.transcription.request import TranscriptionRequest
219
+ from mistral_common.protocol.instruct.messages import RawAudio
220
+ from mistral_common.audio import Audio
221
+ from huggingface_hub import hf_hub_download
222
+
223
+ from openai import OpenAI
224
+
225
+ # Modify OpenAI's API key and API base to use vLLM's API server.
226
+ openai_api_key = "EMPTY"
227
+ openai_api_base = "http://<your-server-host>:8000/v1"
228
+
229
+ client = OpenAI(
230
+ api_key=openai_api_key,
231
+ base_url=openai_api_base,
232
+ )
233
+
234
+ models = client.models.list()
235
+ model = models.data[0].id
236
+
237
+ obama_file = hf_hub_download("patrickvonplaten/audio_samples", "obama.mp3", repo_type="dataset")
238
+ audio = Audio.from_file(obama_file, strict=False)
239
+
240
+ audio = RawAudio.from_audio(audio)
241
+ req = TranscriptionRequest(model=model, audio=audio, language="en", temperature=0.0).to_openai(exclude=("top_p", "seed"))
242
+
243
+ response = client.audio.transcriptions.create(**req)
244
+ print(response)
245
+ ```
246
+ </details>
consolidated.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:ec4dda59bfd9956e71347530d62168cee564c2caf72986c8727355758691eaa7
3
+ size 9348806528
params.json ADDED
@@ -0,0 +1,34 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "dim": 3072,
3
+ "n_layers": 30,
4
+ "head_dim": 128,
5
+ "hidden_dim": 8192,
6
+ "n_heads": 32,
7
+ "n_kv_heads": 8,
8
+ "rope_theta": 100000000.0,
9
+ "norm_eps": 1e-05,
10
+ "vocab_size": 131072,
11
+ "max_position_embeddings": 32768,
12
+ "multimodal": {
13
+ "whisper_model_args": {
14
+ "encoder_args": {
15
+ "dim": 1280,
16
+ "n_layers": 32,
17
+ "head_dim": 64,
18
+ "hidden_dim": 5120,
19
+ "n_heads": 20,
20
+ "vocab_size": 51866,
21
+ "max_source_positions": 1500,
22
+ "audio_encoding_args": {
23
+ "sampling_rate": 16000,
24
+ "num_mel_bins": 128,
25
+ "hop_length": 160,
26
+ "window_size": 400
27
+ }
28
+ },
29
+ "downsample_args": {
30
+ "downsample_factor": 4
31
+ }
32
+ }
33
+ }
34
+ }
tekken.json ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:4aaf3836c2a5332f029ce85a7a62255c966f47b6797ef81dedd0ade9c862e4a8
3
+ size 14894206