deepseek-ai/DeepSeek-R1-0528 · Voice identification

Hi everyone

I'm currently working on a project involving Google Vertex AI and could use your expertise—or a referral to someone with experience in speaker recognition:

I'm processing a 2-minute audio file featuring two speakers who alternate in short bursts of 2–3 seconds. Using Hugging Face’s pyannote library, I perform speaker identification and extracts embedding vectors for each speech segment. The typical result is about 20 segments—roughly 10 per speaker. To construct a voiceprint for each speaker, I average the embeddng vectors associated with that speaker.

I have two main questions:

Is this a sound approach for generating speaker embeddings?
In practice, the results are inconsistent. For instance, comparing the same speaker across different files sometimes yields cosine similarity scores around 0.7—below the expected 0.8+ range. On the other hand, embeddings for different speakers occasionally score as high as 0.68, which seems surprisingly close.
Is there a recommended duration for voiceprint generation?
We've read that voiceprints should ideally be based on no more than 10 seconds of audio, and that longer segments may reduce embedding quality. Does this hold true in practice?

Thank you.