ysdede's picture
Update README.md
cb293b2 verified
---
library_name: transformers
license: mit
base_model: microsoft/Phi-4-multimodal-instruct
tags:
- generated_from_trainer
model-index:
- name: Phi-4-mm-inst-asr-turkish-unf
results: []
datasets:
- ysdede/khanacademy-turkish
- ysdede/khanacademy-turkish-math
- ysdede/commonvoice_17_tr_fixed
language:
- tr
---
<!-- This model card has been generated automatically according to the information the Trainer had access to. You
should probably proofread and complete it, then remove this comment. -->
# Phi-4-mm-inst-asr-turkish-unf
This model is a fine-tuned version of [microsoft/Phi-4-multimodal-instruct](https://huggingface.co/microsoft/Phi-4-multimodal-instruct).
**Model Background**:
This benchmark evaluates a fine-tuned version of Microsoft's **Phi-4-mm-instruct**, a multimodal model not originally designed for Turkish ASR. Key points:
1. **Initial Limitations**:
- No Turkish ASR support in base model
- Initial WER 100+%
2. **Fine-Tuning Process**:
- Unfroze encoder layers for Turkish adaptation
- Trained for 1 epoch on Turkish audio-text pairs
3. **Current Status**:
- Achieved significant WER reduction (100+% → 9.7% on CommonVoice)*
- Still under active development for better generalization
- Results shared as incremental progress documentation
**Why This Matters**:
- Demonstrates adaptability of multimodal architectures
- Provides baseline for Turkish ASR in resource-constrained scenarios
- Encourages exploration of under-supported languages
* **Note on CommonVoice Results**:
- CommonVoice's relatively low WER (9.7%) may benefit from:
- Potential speaker leakage between splits (same speakers in train/test)
- Clean audio conditions despite non-professional recordings
- Short utterance structure (average 4-5 seconds)
- See **below** for full context on CommonVoice characteristics in the "Dataset Notes" section.
### Benchmark Results
**Testing Environment**: Google Colab with L4 GPU (24 GB VRAM)
| Model | WER (%) | CER (%) | Inference Speed (xRT) | Batch Size | Audio Duration (hrs) | Samples Processed |
| :--------------------------------- | -------:| -------:| --------------------: | ----------:| --------------------:| -----------------:|
| ysdede/commonvoice_17_tr_fixed | 9.7 | 2.72 | x26 | 32 | 7.1 | 8,576 |
| erenfazlioglu/turkishvoicedataset | 11.52 | 3.93 | x20 | 16 | 7.8 | 2,496 |
| ysdede/khanacademy-turkish | 12.04 | 7.78 | x16 | 16 | 3.8 | 1,344 |
| ysdede/yeni-split-0 | 20.58 | 13.2 | x16 | 16 | 18 | 5,936 |
| ymoslem/MediaSpeech | 25.48 | 15.16 | x35 | 32 | 10 | 2,496 |
| dssnt1 | 27.23 | 9.6 | x12 | 16 | 2.5 | 1,200 |
| ysdede/yeni-split-lq-noisy | 39.4 | 27 | x19 | 16 | 12 | 3,440 |
**Dataset Notes**:
- **Finetuning Datasets**:
- `commonvoice_17_tr_fixed`: Crowd-sourced clean speech (not professional studio recordings) with shuffled splits - potential **speaker leakage** (same speakers in train/test with different utterances)
- `khanacademy-turkish`: Educational lectures with STEM vocabulary
- `yeni-split-0`: Noisy real-world recordings
- **Benchmark-only Datasets**:
- `turkishvoicedataset`: Synthetic TTS news (clean but artificial prosody)
- `yeni-split-lq-noisy`: Challenging noisy samples with alignment errors
**Text Normalization Challenges**:
⚠️ Current WER/CER scores may be inflated due to:
1. Lack of standardized Turkish ASR text normalization pipeline
2. Case/punctuation inconsistencies in references
3. Agglutinative language morphology affecting word boundaries
**Evaluation Note**:
For Turkish ASR benchmarking, I developed a [text normalizer](https://github.com/ysdede/trnorm) to address language-specific scoring challenges. While imperfect, it helps:
- Convert numbers/dates to words
- Standardize compound word formatting
- Reduce punctuation-related mismatches
This preprocessing makes WER/CER calculations slightly fairer compared to raw scoring, though manual verification remains recommended. The tool is actively being refined based on validation set findings.
**Performance Factors**:
- CommonVoice's relatively low WER (9.7%) likely benefits from:
- High audio quality despite non-professional speakers
- Potential speaker familiarity patterns (same speakers in both splits)
- Short utterance structure (average 4-5 seconds)
## Training procedure
[finetuning Colab notebook](https://colab.research.google.com/drive/1JAQdpX3BtIgDmTLlnHgstKfGw7HjSfej?usp=sharing)
## Model description
More information needed
## Intended uses & limitations
More information needed
## Training and evaluation data
More information needed
### Training hyperparameters
The following hyperparameters were used during training:
- learning_rate: 0.0001
- train_batch_size: 8
- eval_batch_size: 8
- seed: 42
- optimizer: Use OptimizerNames.ADAMW_TORCH with betas=(0.9,0.99) and epsilon=1e-07 and optimizer_args=No additional optimizer arguments
- lr_scheduler_type: cosine
- lr_scheduler_warmup_ratio: 0.1
- num_epochs: 1
### Framework versions
- Transformers 4.48.3
- Pytorch 2.5.1+cu124
- Datasets 3.3.2
- Tokenizers 0.21.0