Does this model identifies speaker?

#16
by SouravAhmed - opened

lets say a call recording of 3 guys. i want to extract the transcription like.

[speaker a] : hello, which alcohol should we buy tonight?
[speaker b]: whisky
[speaker c]: rum or vodka!!! whisky is costly.

SouravAhmed changed discussion title from Does this model identifies speaks. to Does this model identifies speaker?

I am also curious to know about this.

NVIDIA org

You can use sortformer on top of ASR to get speaker labels: https://huggingface.co/nvidia/diar_sortformer_4spk-v1

@nithinraok , can you please share a sample code snippet on how to feed output[0].timestamp['word'] word timestamps I got from asr_model to the diar_model to get a properly aligned speaker diarization?

You can use sortformer on top of ASR to get speaker labels: https://huggingface.co/nvidia/diar_sortformer_4spk-v1

Does sort_former have any direct connection to Parakeet? Or is it a standalone model?

NVIDIA org
  1. Run parakeet ASR -> get timestamps word level
  2. Run sortformer -> get timestamps speaker level
  3. Merge them (see sample code here: https://github.com/NVIDIA/NeMo/blob/77a1697265c7ae48acb4d14e0898b2742f325239/nemo/collections/asr/parts/utils/diarization_utils.py#L819)

parakeet-tdt-0.6b-v2 timestamps for word (or segment) are different from timestamps by diar_sortformer_4spk-v1. Especially on real recordings where people are interrupting each other.
The speaker.start can be less, equal or greater than word.start. Same for the speaker.end and word.end.
I can see that this issue is acknowledged and somewhat addressed in realign_words_with_lm.
I tried to approach same way, but unhappy with the result on real recordings: missing words (especially start/end), single long speech is not divided by sentences (hard to rebuild it to look like asr segment), short interruptions ("yeah", "a-ha", "oh really?", "hm-mmm", etc) from 2nd speaker sometimes gets into 1st speaker sentence.

NVIDIA org

This is not a solution, just a work around as both are standalone models.

@nithinraok the license on nvidia/diar_sortformer_4spk-v1 is non commerical use where as nvidia/parakeet-tdt-0.6b-v2 allows commerical use. Can diar license be changed to commercial as well ?

Sign up or log in to comment