Bug report
I am using colab to transcribe, the stacktrace is shown below
/usr/local/lib/python3.11/dist-packages/nemo/core/classes/common.py in __attach_neural_type(self, obj, metadata, depth, name)
462
463 if type_shape is not None and len(value_shape) != len(type_shape):
--> 464 raise TypeError(
465 f"Output shape mismatch occured for {name} in module {self.class.name} : \n"
466 f"Output shape expected = {type_shape} | \n"
TypeError: Output shape mismatch occured for audio_signal in module AudioToBPEDataset :
Output shape expected = (batch, time) |
Output shape found : torch.Size([1, 1168057, 2])
As mentioned on model card, input should be single channel 16khz audio.
I was able to continue working with this:
import subprocess
import soundfile as sf
# Construct the cleaned file path
cleaned_path = audio_path.replace(".wav", "_cleaned.wav")
# Run ffmpeg to re-encode the file to single-channel, 16kHz
subprocess.run([
"ffmpeg", "-y", # Overwrite without prompting
"-i", audio_path, # Input file
"-ac", "1", # Force mono audio
"-ar", "16000", # Force 16kHz sample rate
cleaned_path
], check=True)
# Verify that the cleaned file has audio data using soundfile
data, sr = sf.read(cleaned_path)
if len(data) == 0:
raise ValueError(f"Converted audio file {cleaned_path} has no data!")
else:
print(f"Converted file has {len(data)} samples at {sr} Hz.")
# Use the cleaned file for transcription
audio_path = cleaned_path