Bug report

#20
by JustinRocks - opened

I am using colab to transcribe, the stacktrace is shown below

/usr/local/lib/python3.11/dist-packages/nemo/core/classes/common.py in __attach_neural_type(self, obj, metadata, depth, name)
462
463 if type_shape is not None and len(value_shape) != len(type_shape):
--> 464 raise TypeError(
465 f"Output shape mismatch occured for {name} in module {self.class.name} : \n"
466 f"Output shape expected = {type_shape} | \n"

TypeError: Output shape mismatch occured for audio_signal in module AudioToBPEDataset :
Output shape expected = (batch, time) |
Output shape found : torch.Size([1, 1168057, 2])

As mentioned on model card, input should be single channel 16khz audio.

I was able to continue working with this:

import subprocess
import soundfile as sf

# Construct the cleaned file path
cleaned_path = audio_path.replace(".wav", "_cleaned.wav")

# Run ffmpeg to re-encode the file to single-channel, 16kHz
subprocess.run([
    "ffmpeg", "-y",          # Overwrite without prompting
    "-i", audio_path,        # Input file
    "-ac", "1",              # Force mono audio
    "-ar", "16000",          # Force 16kHz sample rate
    cleaned_path
], check=True)

# Verify that the cleaned file has audio data using soundfile
data, sr = sf.read(cleaned_path)
if len(data) == 0:
    raise ValueError(f"Converted audio file {cleaned_path} has no data!")
else:
    print(f"Converted file has {len(data)} samples at {sr} Hz.")

# Use the cleaned file for transcription
audio_path = cleaned_path

Sign up or log in to comment