TitaNet-Large β€” GGUF

GGUF conversion of nvidia/speakerverification_en_titanet_large (CC-BY-4.0) for native C/C++ inference via CrispASR.

Model

Detail Value
Architecture TitaNet-Large β€” depthwise separable Conv1D encoder + ASP decoder
Parameters 23M
Embedding dim 192 (L2-normalized)
EER 0.66% on VoxCeleb1-O cleaned
Input 16 kHz mono PCM
GGUF size ~45 MB (F16 weights, F32 batch-norm)
License CC-BY-4.0

Architecture

Preprocessor: 16kHz β†’ 80-bin mel spectrogram (Hann window, n_fft=512, hop=160, win=400)

Encoder (Jasper-style):
  Block 0 (prolog):  DW-Conv(80, k=3) + PW-Conv(80β†’1024) + BN + SE + ReLU
  Block 1:           3Γ— DW-Sep-Conv(1024, k=7)  + SE + residual + ReLU
  Block 2:           3Γ— DW-Sep-Conv(1024, k=11) + SE + residual + ReLU
  Block 3:           3Γ— DW-Sep-Conv(1024, k=15) + SE + residual + ReLU
  Block 4 (epilog):  DW-Conv(1024, k=1) + PW-Conv(1024β†’3072) + BN + SE + ReLU

Decoder:
  ASP (Attentive Statistics Pooling): 3072 β†’ 6144
  BN + Linear: 6144 β†’ 192
  L2-normalize

Usage with CrispASR

Speaker enrollment

# Enroll a speaker from a reference audio clip
crispasr --enroll-speaker alice \
         --speaker-db ./speakers \
         -f alice_reference.wav

Speaker identification during transcription

# Transcribe with speaker identification
crispasr --backend parakeet \
         --speaker-db ./speakers \
         -f meeting.wav
# Output: (alice) Hello everyone...

Standalone embedding extraction

# Extract speaker embedding (test binary)
test-titanet titanet-large.gguf audio1.wav audio2.wav
# Prints cosine similarity matrix

Python

from crispasr import TitaNet, SpeakerDB

with TitaNet("titanet-large.gguf") as model:
    emb = model.embed(pcm_16k_float32)

db = SpeakerDB("./speakers")
db.enroll("alice", emb)
name, score = db.match(emb, threshold=0.7)

Conversion

python models/convert-titanet-to-gguf.py \
    --input nvidia/speakerverification_en_titanet_large \
    --output titanet-large.gguf

Verification

Encoder + decoder parity with NeMo reference: cos = 0.999997 (mel-injected). End-to-end parity: cos = 0.917 (STFT float32 precision gap in mel front-end).

Citation

@inproceedings{koluguri2022titanet,
  title={TitaNet: Neural Model for Speaker Representation with 1D Depth-wise Separable Convolutions and Global Context},
  author={Koluguri, Nithin Rao and Park, Taejin and Ginsburg, Boris},
  booktitle={ICASSP 2022},
  year={2022}
}
Downloads last month
142
GGUF
Model size
22.2M params
Architecture
titanet
Hardware compatibility
Log In to add your hardware

We're not able to determine the quantization variants.

Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for cstr/titanet-large-GGUF

Quantized
(1)
this model