|
--- |
|
library_name: transformers |
|
license: apache-2.0 |
|
datasets: |
|
- ivrit-ai/crowd-transcribe-v5 |
|
- ivrit-ai/crowd-recital-whisper-training |
|
- ivrit-ai/knesset-plenums-whisper-training |
|
language: |
|
- he |
|
metrics: |
|
- wer |
|
base_model: |
|
- openai/whisper-large-v3-turbo |
|
pipeline_tag: automatic-speech-recognition |
|
--- |
|
|
|
# Model Card for Model ID |
|
|
|
This model is a Hebrew finetune (continued training) of the OpenAI Whisper Large v3 Turbo model. |
|
|
|
|
|
## Model Details |
|
|
|
### Model Description |
|
|
|
- **Developed by:** ivrit-ai |
|
- **Language(s) (NLP):** Hebrew |
|
- **License:** Apache-2.0 |
|
- **Finetuned from model** openai/whisper-large-v3-turbo |
|
|
|
## Bias, Risks, and Limitations |
|
|
|
Language detection capability of this model has been degraded during training - it is intended for mostly-hebrew audio transcription. |
|
Language token should be explicitly set to Hebrew. |
|
|
|
Additionally, the tanslation task was not trained and also degraded. This model would not be able to translate in any reasonable capacity. |
|
|
|
## How to Get Started with the Model |
|
|
|
Please follow the original [model card](https://huggingface.co/openai/whisper-large-v3-turbo#usage) for usage details - replacing with this model name. |
|
You can also fine other weight formats ad quantizations on the [ivrit ai](https://huggingface.co/ivrit-ai) HF page. |
|
|
|
We created some simple example scripts using this model and weights for other indference runtimes. |
|
Find those in the ["examples"](https://github.com/ivrit-ai/asr-training/tree/master/examples) folder within the training GitHub repo. |
|
|
|
## Training Details |
|
|
|
### Training Data |
|
|
|
This model was trained on the following datasets: |
|
|
|
- [ivrit-ai/crowd-transcribe-v5](https://huggingface.co/datasets/ivrit-ai/crowd-transcribe-v5) - Publicly accessible audio sources have beem crowd-transcribed segment-by-segment - ~300h |
|
- [ivrit-ai/crowd-recital-whisper-training](https://huggingface.co/datasets/ivrit-ai/crowd-recital-whisper-training) - Crowd-sourced recording of Wikipedia atricle snippets. ~50h |
|
- [ivrit-ai/knesset-plenums-whisper-training](https://huggingface.co/datasets/ivrit-ai/knesset-plenums-whisper-training) - A subset of a Knesset (Israeli house of representitives) plenum protocols. ~325h |
|
|
|
### Training Procedure |
|
|
|
This model is a weighted-average of the lowest eval loss checkpoints (From around the end of epoch 2) from two seprate runs with the same setup. |
|
Training code can be found on the ivrit-ai Github [here](https://github.com/ivrit-ai/asr-training) |
|
|
|
#### Preprocessing |
|
|
|
The "Crowd Recital" and "Knesset" datasets contain timestamps and previous text following the Whisper expected inputs. |
|
Timestamps were used from 40% of samples from those datasets, and 50% of the previous text was used. |
|
|
|
The "Crowd Transcribe" datasets has no timestamps or previous text and this preprocessing only included melspec feature extraction and text encoding. |
|
|
|
Preprocessing code can be found within the training code [repository](https://github.com/ivrit-ai/asr-training). |
|
|
|
Datasets were interleaved with 0.15:0.8:0.05 ratio (knesset:crowd-transcribe:crowd-recital). |
|
|
|
#### Training Hyperparameters |
|
|
|
- **Training regime:** bf16 mixed precision with sdpa |
|
- **Learning Rate:** 1e-5, Linear decay, 800 steps warmup for 3 epochs |
|
- **Batch Size:** 32 |
|
|
|
#### Training Hardward / Duration |
|
|
|
- **GPU Type:** 8 x Nvidia A40 machine |
|
- **Duration:** ~9h run, stopped at 3 epochs |
|
|
|
## Evaluation |
|
|
|
Please refer to the [ivrit-ai/hebrew-transcription-leaderboard](https://huggingface.co/spaces/ivrit-ai/hebrew-transcription-leaderboard) |