Difference between Speaker-wavLM-pro and Speaker-wavLM-tbr
Hello, I want to know the difference between Speaker-wavLM-pro and Speaker-wavLM-tbr. I downloaded both models and found that Speaker-wavLM-tbr is more discriminative in scoring. Is it because of the different fine-tuning datasets or different fine-tuning methods? Please help me answer this question, thank you very much.
Hi labi,
The two models focus on different aspects of the voices. By using Speaker-wavLM-pro you only compare the prosody aspects of the voice (~ melody, rhythm...) while Speaker-wavLM-tbr focuses on timbral characteristics (~ frequency content). As you noticed, timbral cues are a bit more discriminative than prosodic cues. But for a basic Speaker Verification task (ASV), the Speaker-wavLM-id model should be prefered as it deals with all the aspects of the voices, and leads to better speaker discrimination and ASV performance.
The way the models have been trained is explained in the paper https://www.isca-archive.org/interspeech_2024/gengembre24_interspeech.pdf. In short, it relies on data manipulations (voice conversion) to hide timbral aspects for Speaker-wavLM-pro. Speaker-wavLM-tbr is then trained to capture complementary cues to those captured by Speaker-wavLM-pro.
I hope it will help.
Regards