FireRedVAD Stream-VAD (ONNX)

ONNX export of the Stream-VAD model from FireRedTeam/FireRedVAD for real-time streaming voice activity detection.

Model Details

Property Value
Architecture DFSMN (Deep Feedforward Sequential Memory Network)
FSMN Blocks 8
Projection Dim 128
Initial Cache Length 10
Input Features 80-dim log-mel filterbank (fbank)
Frame Length 25ms (400 samples @ 16kHz)
Frame Shift 10ms (160 samples @ 16kHz)
Output Speech probability per frame (sigmoid, 0–1)
ONNX Opset 17

Files

  • firered_vad.onnx β€” The DFSMN model (2.3 MB)
  • cmvn.json β€” CMVN normalization parameters (80-dim means + inverse stddevs)
  • model_meta.json β€” Architecture metadata for runtime initialization

Input/Output Specification

Inputs

Name Shape Description
feat [1, num_frames, 80] CMVN-normalized fbank features
cache_0–cache_7 [1, 128, cache_len] Per-block FSMN streaming caches

Outputs

Name Shape Description
probs [1, num_frames, 1] Speech probability per frame
new_cache_0–new_cache_7 [1, 128, new_cache_len] Updated caches (carry to next call)

Streaming Usage

Initialize caches as zeros with cache_len=10:

caches = {f"cache_{i}": np.zeros((1, 128, 10), dtype=np.float32) for i in range(8)}

For each audio chunk:

  1. Extract 80-dim fbank features (25ms window, 10ms shift, 16kHz)
  2. Apply CMVN normalization: (feature - mean) * inv_stddev
  3. Run inference with feat + current caches
  4. Carry new_cache_* outputs to the next call
  5. Speech probability > 0.5 indicates speech
import onnxruntime as ort
import numpy as np

sess = ort.InferenceSession("firered_vad.onnx")
caches = {f"cache_{i}": np.zeros((1, 128, 10), dtype=np.float32) for i in range(8)}

# For each chunk of fbank features:
feat = extract_fbank(audio_chunk)  # [1, num_frames, 80]
feat = apply_cmvn(feat, cmvn)      # normalize
inputs = {"feat": feat, **caches}
outputs = sess.run(None, inputs)

probs = outputs[0]                  # [1, num_frames, 1]
for i in range(8):
    caches[f"cache_{i}"] = outputs[i + 1]

Export Script

This model was exported using scripts/export_firered_vad.py from the second-brain project. The script:

  1. Downloads the official PyTorch weights from FireRedTeam/FireRedVAD
  2. Wraps the model with flattened cache I/O for ONNX compatibility
  3. Exports with dynamic axes for variable-length streaming input
  4. Converts Kaldi CMVN ark β†’ JSON

License

Apache 2.0, following the original FireRedTeam/FireRedVAD license.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for Mazino0/fire-red-streaming-vad-onnx

Quantized
(3)
this model