|
--- |
|
library_name: transformers |
|
license: mit |
|
base_model: microsoft/Phi-4-multimodal-instruct |
|
tags: |
|
- generated_from_trainer |
|
model-index: |
|
- name: Phi-4-mm-inst-asr-turkish-unf |
|
results: [] |
|
datasets: |
|
- ysdede/khanacademy-turkish |
|
- ysdede/khanacademy-turkish-math |
|
- ysdede/commonvoice_17_tr_fixed |
|
language: |
|
- tr |
|
--- |
|
|
|
<!-- This model card has been generated automatically according to the information the Trainer had access to. You |
|
should probably proofread and complete it, then remove this comment. --> |
|
|
|
# Phi-4-mm-inst-asr-turkish-unf |
|
|
|
This model is a fine-tuned version of [microsoft/Phi-4-multimodal-instruct](https://huggingface.co/microsoft/Phi-4-multimodal-instruct). |
|
|
|
**Model Background**: |
|
This benchmark evaluates a fine-tuned version of Microsoft's **Phi-4-mm-instruct**, a multimodal model not originally designed for Turkish ASR. Key points: |
|
|
|
1. **Initial Limitations**: |
|
- No Turkish ASR support in base model |
|
- Initial WER 100+% |
|
|
|
2. **Fine-Tuning Process**: |
|
- Unfroze encoder layers for Turkish adaptation |
|
- Trained for 1 epoch on Turkish audio-text pairs |
|
|
|
3. **Current Status**: |
|
- Achieved significant WER reduction (100+% → 9.7% on CommonVoice)* |
|
- Still under active development for better generalization |
|
- Results shared as incremental progress documentation |
|
|
|
**Why This Matters**: |
|
- Demonstrates adaptability of multimodal architectures |
|
- Provides baseline for Turkish ASR in resource-constrained scenarios |
|
- Encourages exploration of under-supported languages |
|
|
|
* **Note on CommonVoice Results**: |
|
- CommonVoice's relatively low WER (9.7%) may benefit from: |
|
- Potential speaker leakage between splits (same speakers in train/test) |
|
- Clean audio conditions despite non-professional recordings |
|
- Short utterance structure (average 4-5 seconds) |
|
- See **below** for full context on CommonVoice characteristics in the "Dataset Notes" section. |
|
|
|
### Benchmark Results |
|
|
|
**Testing Environment**: Google Colab with L4 GPU (24 GB VRAM) |
|
|
|
| Model | WER (%) | CER (%) | Inference Speed (xRT) | Batch Size | Audio Duration (hrs) | Samples Processed | |
|
| :--------------------------------- | -------:| -------:| --------------------: | ----------:| --------------------:| -----------------:| |
|
| ysdede/commonvoice_17_tr_fixed | 9.7 | 2.72 | x26 | 32 | 7.1 | 8,576 | |
|
| erenfazlioglu/turkishvoicedataset | 11.52 | 3.93 | x20 | 16 | 7.8 | 2,496 | |
|
| ysdede/khanacademy-turkish | 12.04 | 7.78 | x16 | 16 | 3.8 | 1,344 | |
|
| ysdede/yeni-split-0 | 20.58 | 13.2 | x16 | 16 | 18 | 5,936 | |
|
| ymoslem/MediaSpeech | 25.48 | 15.16 | x35 | 32 | 10 | 2,496 | |
|
| dssnt1 | 27.23 | 9.6 | x12 | 16 | 2.5 | 1,200 | |
|
| ysdede/yeni-split-lq-noisy | 39.4 | 27 | x19 | 16 | 12 | 3,440 | |
|
|
|
**Dataset Notes**: |
|
- **Finetuning Datasets**: |
|
- `commonvoice_17_tr_fixed`: Crowd-sourced clean speech (not professional studio recordings) with shuffled splits - potential **speaker leakage** (same speakers in train/test with different utterances) |
|
- `khanacademy-turkish`: Educational lectures with STEM vocabulary |
|
- `yeni-split-0`: Noisy real-world recordings |
|
|
|
- **Benchmark-only Datasets**: |
|
- `turkishvoicedataset`: Synthetic TTS news (clean but artificial prosody) |
|
- `yeni-split-lq-noisy`: Challenging noisy samples with alignment errors |
|
|
|
**Text Normalization Challenges**: |
|
⚠️ Current WER/CER scores may be inflated due to: |
|
1. Lack of standardized Turkish ASR text normalization pipeline |
|
2. Case/punctuation inconsistencies in references |
|
3. Agglutinative language morphology affecting word boundaries |
|
|
|
**Evaluation Note**: |
|
For Turkish ASR benchmarking, I developed a [text normalizer](https://github.com/ysdede/trnorm) to address language-specific scoring challenges. While imperfect, it helps: |
|
- Convert numbers/dates to words |
|
- Standardize compound word formatting |
|
- Reduce punctuation-related mismatches |
|
|
|
This preprocessing makes WER/CER calculations slightly fairer compared to raw scoring, though manual verification remains recommended. The tool is actively being refined based on validation set findings. |
|
|
|
**Performance Factors**: |
|
- CommonVoice's relatively low WER (9.7%) likely benefits from: |
|
- High audio quality despite non-professional speakers |
|
- Potential speaker familiarity patterns (same speakers in both splits) |
|
- Short utterance structure (average 4-5 seconds) |
|
|
|
|
|
## Training procedure |
|
[finetuning Colab notebook](https://colab.research.google.com/drive/1JAQdpX3BtIgDmTLlnHgstKfGw7HjSfej?usp=sharing) |
|
|
|
## Model description |
|
|
|
More information needed |
|
|
|
## Intended uses & limitations |
|
|
|
More information needed |
|
|
|
## Training and evaluation data |
|
|
|
More information needed |
|
|
|
### Training hyperparameters |
|
|
|
The following hyperparameters were used during training: |
|
- learning_rate: 0.0001 |
|
- train_batch_size: 8 |
|
- eval_batch_size: 8 |
|
- seed: 42 |
|
- optimizer: Use OptimizerNames.ADAMW_TORCH with betas=(0.9,0.99) and epsilon=1e-07 and optimizer_args=No additional optimizer arguments |
|
- lr_scheduler_type: cosine |
|
- lr_scheduler_warmup_ratio: 0.1 |
|
- num_epochs: 1 |
|
|
|
|
|
### Framework versions |
|
|
|
- Transformers 4.48.3 |
|
- Pytorch 2.5.1+cu124 |
|
- Datasets 3.3.2 |
|
- Tokenizers 0.21.0 |