Does this model identifies speaker?

#16

by SouravAhmed - opened May 7

May 7

lets say a call recording of 3 guys. i want to extract the transcription like.

[speaker a] : hello, which alcohol should we buy tonight?
[speaker b]: whisky
[speaker c]: rum or vodka!!! whisky is costly.

SouravAhmed changed discussion title from Does this model identifies speaks. to Does this model identifies speaker? May 7

tirthai

May 7

I am also curious to know about this.

nithinraok

NVIDIA org May 7

You can use sortformer on top of ASR to get speaker labels: https://huggingface.co/nvidia/diar_sortformer_4spk-v1

Savvkin

May 8

@nithinraok , can you please share a sample code snippet on how to feed output[0].timestamp['word'] word timestamps I got from asr_model to the diar_model to get a properly aligned speaker diarization?

BigDeeper

24 days ago

You can use sortformer on top of ASR to get speaker labels: https://huggingface.co/nvidia/diar_sortformer_4spk-v1

Does sort_former have any direct connection to Parakeet? Or is it a standalone model?

nithinraok

NVIDIA org 23 days ago

Run parakeet ASR -> get timestamps word level
Run sortformer -> get timestamps speaker level
Merge them (see sample code here: https://github.com/NVIDIA/NeMo/blob/77a1697265c7ae48acb4d14e0898b2742f325239/nemo/collections/asr/parts/utils/diarization_utils.py#L819)

Savvkin

23 days ago

parakeet-tdt-0.6b-v2 timestamps for word (or segment) are different from timestamps by diar_sortformer_4spk-v1. Especially on real recordings where people are interrupting each other.
The speaker.start can be less, equal or greater than word.start. Same for the speaker.end and word.end.
I can see that this issue is acknowledged and somewhat addressed in realign_words_with_lm.
I tried to approach same way, but unhappy with the result on real recordings: missing words (especially start/end), single long speech is not divided by sentences (hard to rebuild it to look like asr segment), short interruptions ("yeah", "a-ha", "oh really?", "hm-mmm", etc) from 2nd speaker sometimes gets into 1st speaker sentence.

nithinraok

NVIDIA org 23 days ago

This is not a solution, just a work around as both are standalone models.

Jayakumark

2 days ago

@nithinraok the license on nvidia/diar_sortformer_4spk-v1 is non commerical use where as nvidia/parakeet-tdt-0.6b-v2 allows commerical use. Can diar license be changed to commercial as well ?

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment