Orange/Speaker-wavLM-pro · Difference between Speaker-wavLM-pro and Speaker-wavLM-tbr

Hi labi,

The two models focus on different aspects of the voices. By using Speaker-wavLM-pro you only compare the prosody aspects of the voice (~ melody, rhythm...) while Speaker-wavLM-tbr focuses on timbral characteristics (~ frequency content). As you noticed, timbral cues are a bit more discriminative than prosodic cues. But for a basic Speaker Verification task (ASV), the Speaker-wavLM-id model should be prefered as it deals with all the aspects of the voices, and leads to better speaker discrimination and ASV performance.
The way the models have been trained is explained in the paper https://www.isca-archive.org/interspeech_2024/gengembre24_interspeech.pdf. In short, it relies on data manipulations (voice conversion) to hide timbral aspects for Speaker-wavLM-pro. Speaker-wavLM-tbr is then trained to capture complementary cues to those captured by Speaker-wavLM-pro.

I hope it will help.
Regards