Surprisingly Good for 3B — But Not Perfect

#15
by Hussain2050 - opened

Really impressed with Voxtral Mini 3B kind of wild how much it can handle with just 3B parameters. It does a solid job overall, but I did notice the transcription starts to slip a bit when the audio is noisy or mixes multiple languages.
Whisper Large v3 still feels a bit more robust in those tricky cases.

That said, for running locally or on smaller devices, this is a huge step forward. Super excited to see where this goes next.

I'm sure it can be finetuned for noisy audio, too.

But, it takes up almost 40GB VRAM for me, so I'm confused when you say "for running ... on smaller devices." How much does it take up for you?

I’m running the Hugging Face version on a 12GB GPU with no problem VRAM sits around 10GB during normal use. When I said “smaller devices,” I meant stuff like that.
If you’re hitting 40GB, maybe it’s because of really long audio or context windows? That can blow up memory usage fast. On my setup with default settings, it runs solidly without that much VRAM.

For some reason, it was just VLLM.

I loaded under Transformers and with quants I can get <5GB VRAM, which is great for consumer GPUs.

Thanks!

Glad that worked out! Yeah, vLLM is super fast but can be heavy on memory because of how it handles batching and KV cache.
Transformers + quantized models are much more GPU-friendly for local use — I’ve had great results that way too.

I'm hoping we can get a GGUF soon.

There's a text-only quant on the Hub, but obviously that defeats the purpose of the model. I think I'll try merging it with the Whisper audio encoders, alongside some monkey-patching for LlamaCPP inference.

Yeah, GGUF only works with decoder-only models like LLaMA , so we can’t convert the full Voxtral with audio encoder yet. The text-only quant works, but it’s not useful for transcription. Your idea with Whisper encoders sounds interesting though( let me know if it works!)

Yes, the encoders will have to be full integer precision, but the decoder quants will loosen up some memory.

Even with a quantized model, I'm thinking that transcription accuracy can be maintained through prompting and constrained decoding. I'll put it into a Python library if everything works out.

You really know your stuff! I really love what you have said.
Keeping encoders full precision while quantizing the decoder is a smart balance. Lots of people will be thankful for your hard work on the library!

Sign up or log in to comment