Does this model identifies speaker?
lets say a call recording of 3 guys. i want to extract the transcription like.
[speaker a] : hello, which alcohol should we buy tonight?
[speaker b]: whisky
[speaker c]: rum or vodka!!! whisky is costly.
I am also curious to know about this.
You can use sortformer on top of ASR to get speaker labels: https://huggingface.co/nvidia/diar_sortformer_4spk-v1
@nithinraok
, can you please share a sample code snippet on how to feed output[0].timestamp['word']
word timestamps I got from asr_model
to the diar_model
to get a properly aligned speaker diarization?
You can use sortformer on top of ASR to get speaker labels: https://huggingface.co/nvidia/diar_sortformer_4spk-v1
Does sort_former have any direct connection to Parakeet? Or is it a standalone model?
- Run parakeet ASR -> get timestamps word level
- Run sortformer -> get timestamps speaker level
- Merge them (see sample code here: https://github.com/NVIDIA/NeMo/blob/77a1697265c7ae48acb4d14e0898b2742f325239/nemo/collections/asr/parts/utils/diarization_utils.py#L819)
parakeet-tdt-0.6b-v2
timestamps for word
(or segment
) are different from timestamps by diar_sortformer_4spk-v1
. Especially on real recordings where people are interrupting each other.
The speaker.start
can be less, equal or greater than word.start
. Same for the speaker.end
and word.end
.
I can see that this issue is acknowledged and somewhat addressed in realign_words_with_lm.
I tried to approach same way, but unhappy with the result on real recordings: missing words (especially start/end), single long speech is not divided by sentences (hard to rebuild it to look like asr segment
), short interruptions ("yeah", "a-ha", "oh really?", "hm-mmm", etc) from 2nd speaker sometimes gets into 1st speaker sentence.
This is not a solution, just a work around as both are standalone models.
@nithinraok the license on nvidia/diar_sortformer_4spk-v1 is non commerical use where as nvidia/parakeet-tdt-0.6b-v2 allows commerical use. Can diar license be changed to commercial as well ?