FireRedVAD Stream-VAD (ONNX)

ONNX export of the Stream-VAD model from FireRedTeam/FireRedVAD for real-time streaming voice activity detection.

Model Details

Property	Value
Architecture	DFSMN (Deep Feedforward Sequential Memory Network)
FSMN Blocks	8
Projection Dim	128
Initial Cache Length	10
Input Features	80-dim log-mel filterbank (fbank)
Frame Length	25ms (400 samples @ 16kHz)
Frame Shift	10ms (160 samples @ 16kHz)
Output	Speech probability per frame (sigmoid, 0–1)
ONNX Opset	17

Files

firered_vad.onnx — The DFSMN model (2.3 MB)
cmvn.json — CMVN normalization parameters (80-dim means + inverse stddevs)
model_meta.json — Architecture metadata for runtime initialization

Input/Output Specification

Inputs

Name	Shape	Description
`feat`	`[1, num_frames, 80]`	CMVN-normalized fbank features
`cache_0`–`cache_7`	`[1, 128, cache_len]`	Per-block FSMN streaming caches

Outputs

Name	Shape	Description
`probs`	`[1, num_frames, 1]`	Speech probability per frame
`new_cache_0`–`new_cache_7`	`[1, 128, new_cache_len]`	Updated caches (carry to next call)

Streaming Usage

Initialize caches as zeros with cache_len=10:

caches = {f"cache_{i}": np.zeros((1, 128, 10), dtype=np.float32) for i in range(8)}

For each audio chunk:

Extract 80-dim fbank features (25ms window, 10ms shift, 16kHz)
Apply CMVN normalization: (feature - mean) * inv_stddev
Run inference with feat + current caches
Carry new_cache_* outputs to the next call
Speech probability > 0.5 indicates speech

import onnxruntime as ort
import numpy as np

sess = ort.InferenceSession("firered_vad.onnx")
caches = {f"cache_{i}": np.zeros((1, 128, 10), dtype=np.float32) for i in range(8)}

# For each chunk of fbank features:
feat = extract_fbank(audio_chunk)  # [1, num_frames, 80]
feat = apply_cmvn(feat, cmvn)      # normalize
inputs = {"feat": feat, **caches}
outputs = sess.run(None, inputs)

probs = outputs[0]                  # [1, num_frames, 1]
for i in range(8):
    caches[f"cache_{i}"] = outputs[i + 1]

Export Script

This model was exported using scripts/export_firered_vad.py from the second-brain project. The script:

Downloads the official PyTorch weights from FireRedTeam/FireRedVAD
Wraps the model with flattened cache I/O for ONNX compatibility
Exports with dynamic axes for variable-length streaming input
Converts Kaldi CMVN ark → JSON

License

Apache 2.0, following the original FireRedTeam/FireRedVAD license.

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

Voice Activity Detection

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Mazino0/fire-red-streaming-vad-onnx

Base model

FireRedTeam/FireRedVAD

Quantized

(3)

this model