Only English transcriptions on Dutch transcribe task?

#13

by RikRaes - opened May 9, 2023

May 9, 2023

When performing the transcribe task on the Dutch Common Voice Data (locally downloaded), I seem to only obtain English transcriptions for the tiny, small, and base models which are the ones I have tested so far. Therefore, I assume there is a mistake in the code or the way I use the pipeline, could anyone help me? I posted the code below.
pipe_whisper = pipeline(model="openai/whisper-tiny", device=device, tokenizer=WhisperTokenizer.from_pretrained("openai/whisper-tiny", language="Dutch", task="transcribe"))
df["transcription_whisper"] = df["path"].progress_apply(lambda path: pipe_whisper(DATA_COMMON_VOICE_PATH/path))

ArthurZ

Oct 10, 2023

Hey! This means either once of three:

the model translates
the model is bad at transcribing dutch.
the task is not fed properly

You should try forwarding the task to whisper using pipe = pipeline(.....,generate_kwargs={"task": "transcribe", "language": "Dutch"}

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment