Pichilti-base for Automatic Speech Recognition in Azerbaijani

Baku Higher Oil School Research and Development Center on AI introduces their Research on creating better Whisper model on monolingual basis. This model takes the input in audio format and converts it to the text. Model has been trained in self-supervised way with over 500 000 audios without any labels. This helped us to bypass the heavy labelling procedure.

Model is originally pre-trained on multilingual base by OpenAI. Original Model is called Whisper while it has variations like tiny, base, small, medium, large-v2, large-v3. Complexity of the model increases with its computational cost. Therefore for large scale operations smaller versions are preffered while in accuracy critical operations large models are in the production.

Considering the power of pretraining in the Whisper models, we decided to keep the encoder of the model as it is because we tested that the model encoder is very robust to noise to generate necessary audio features. The main reason is that model has been trained on 680 000 hours of data with zero-shot learning. This is a lot of power for the stability of the model. However the decoder part has multimodal features like translation and transcription, we decided to take the model, freeze the encoder and fine-tune the model based on self-supervised learning. This gave us better CER (Character error rate) while the computational cost decreased significantly.

Due to ongoing research, details of training will be published after the acceptance of the paper.

Try it out

In order to try this code in your own server or PC first 2 packages should be downloaded:

& pip install openai-whisper
& pip install transformers

from transformers import WhisperProcessor, WhisperForConditionalGeneration

from whisper import load_audio

waveform = load_audio("test.mp3")

processor = WhisperProcessor.from_pretrained("BHOSAI/Pichilti-base-v1")
model = WhisperForConditionalGeneration.from_pretrained("BHOSAI/Pichilti-base-v1")


input_features = processor(
    waveform, return_tensors="pt"
).input_features

# Generate token ids
predicted_ids = model.generate(input_features)

# Decode token ids to text
transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True)

print(transcription)

About Us

Baku Higher Oil School Research and Development Center on AI is a team of students who have passion to contribute to the open-source community of the Azerbaijani NLP products. Center is based in Azerbaijan, Baku.

Downloads last month: 48

Safetensors

Model size

72.6M params

Tensor type

F32