Voice identification

#102
by Michar64 - opened

Hi  everyone 

I'm currently working on a project involving Google Vertex AI and could use your expertise—or a referral to someone with experience in speaker  recognition:

I'm processing a 2-minute audio file featuring two speakers who alternate in short bursts of 2–3 seconds. Using Hugging Face’s pyannote library, I perform speaker  identification and extracts embedding vectors for each speech segment. The typical result is about 20 segments—roughly 10 per speaker. To construct a voiceprint for each speaker, I  average the embeddng vectors associated with that speaker.

I have  two main questions:

  1. Is this a sound approach for generating speaker embeddings?
    In practice, the results are inconsistent. For instance, comparing the same speaker across different files sometimes yields cosine similarity scores around 0.7—below the expected 0.8+ range. On the other hand, embeddings for different speakers occasionally score as high as 0.68, which seems surprisingly close.

  2. Is there a recommended duration for voiceprint generation?
    We've read that voiceprints should ideally be based on no more than 10 seconds of audio, and that longer segments may reduce embedding quality. Does this hold true in practice?

Thank you. 

Sign up or log in to comment