File size: 5,723 Bytes

---
library_name: transformers
license: mit
base_model: microsoft/Phi-4-multimodal-instruct
tags:
- generated_from_trainer
model-index:
- name: Phi-4-mm-inst-asr-turkish-unf
  results: []
datasets:
- ysdede/khanacademy-turkish
- ysdede/khanacademy-turkish-math
- ysdede/commonvoice_17_tr_fixed
language:
- tr
---

<!-- This model card has been generated automatically according to the information the Trainer had access to. You
should probably proofread and complete it, then remove this comment. -->

# Phi-4-mm-inst-asr-turkish-unf

This model is a fine-tuned version of [microsoft/Phi-4-multimodal-instruct](https://huggingface.co/microsoft/Phi-4-multimodal-instruct).

**Model Background**:  
This benchmark evaluates a fine-tuned version of Microsoft's **Phi-4-mm-instruct**, a multimodal model not originally designed for Turkish ASR. Key points:  

1. **Initial Limitations**:  
   - No Turkish ASR support in base model  
   - Initial WER 100+%  

2. **Fine-Tuning Process**:  
   - Unfroze encoder layers for Turkish adaptation  
   - Trained for 1 epoch on Turkish audio-text pairs  

3. **Current Status**:  
   - Achieved significant WER reduction (100+% → 9.7% on CommonVoice)*  
   - Still under active development for better generalization  
   - Results shared as incremental progress documentation  

**Why This Matters**:  
- Demonstrates adaptability of multimodal architectures  
- Provides baseline for Turkish ASR in resource-constrained scenarios  
- Encourages exploration of under-supported languages  

* **Note on CommonVoice Results**:  
   - CommonVoice's relatively low WER (9.7%) may benefit from:  
     - Potential speaker leakage between splits (same speakers in train/test)  
     - Clean audio conditions despite non-professional recordings  
     - Short utterance structure (average 4-5 seconds)  
   - See **below** for full context on CommonVoice characteristics in the "Dataset Notes" section.

### Benchmark Results

**Testing Environment**: Google Colab with L4 GPU (24 GB VRAM)  

| Model                              | WER (%) | CER (%) | Inference Speed (xRT) | Batch Size | Audio Duration (hrs) | Samples Processed |  
| :--------------------------------- | -------:| -------:| --------------------: | ----------:| --------------------:| -----------------:|  
| ysdede/commonvoice_17_tr_fixed     | 9.7     | 2.72    | x26                  | 32         | 7.1                  | 8,576             |  
| erenfazlioglu/turkishvoicedataset  | 11.52   | 3.93    | x20                  | 16         | 7.8                  | 2,496             |  
| ysdede/khanacademy-turkish         | 12.04   | 7.78    | x16                  | 16         | 3.8                  | 1,344             |  
| ysdede/yeni-split-0                | 20.58   | 13.2    | x16                  | 16         | 18                   | 5,936             |  
| ymoslem/MediaSpeech                | 25.48   | 15.16   | x35                  | 32         | 10                   | 2,496             |  
| dssnt1                             | 27.23   | 9.6     | x12                  | 16         | 2.5                  | 1,200             |  
| ysdede/yeni-split-lq-noisy         | 39.4    | 27      | x19                  | 16         | 12                   | 3,440             |  

**Dataset Notes**:  
- **Finetuning Datasets**:  
  - `commonvoice_17_tr_fixed`: Crowd-sourced clean speech (not professional studio recordings) with shuffled splits - potential **speaker leakage** (same speakers in train/test with different utterances)  
  - `khanacademy-turkish`: Educational lectures with STEM vocabulary  
  - `yeni-split-0`: Noisy real-world recordings  

- **Benchmark-only Datasets**:  
  - `turkishvoicedataset`: Synthetic TTS news (clean but artificial prosody)  
  - `yeni-split-lq-noisy`: Challenging noisy samples with alignment errors  

**Text Normalization Challenges**:  
⚠️ Current WER/CER scores may be inflated due to:  
1. Lack of standardized Turkish ASR text normalization pipeline  
2. Case/punctuation inconsistencies in references  
3. Agglutinative language morphology affecting word boundaries  

**Evaluation Note**:  
For Turkish ASR benchmarking, I developed a [text normalizer](https://github.com/ysdede/trnorm) to address language-specific scoring challenges. While imperfect, it helps:
- Convert numbers/dates to words  
- Standardize compound word formatting  
- Reduce punctuation-related mismatches  

This preprocessing makes WER/CER calculations slightly fairer compared to raw scoring, though manual verification remains recommended. The tool is actively being refined based on validation set findings.

**Performance Factors**:  
- CommonVoice's relatively low WER (9.7%) likely benefits from:  
  - High audio quality despite non-professional speakers  
  - Potential speaker familiarity patterns (same speakers in both splits)  
  - Short utterance structure (average 4-5 seconds)  


## Training procedure
[finetuning Colab notebook](https://colab.research.google.com/drive/1JAQdpX3BtIgDmTLlnHgstKfGw7HjSfej?usp=sharing)

## Model description

More information needed

## Intended uses & limitations

More information needed

## Training and evaluation data

More information needed

### Training hyperparameters

The following hyperparameters were used during training:
- learning_rate: 0.0001
- train_batch_size: 8
- eval_batch_size: 8
- seed: 42
- optimizer: Use OptimizerNames.ADAMW_TORCH with betas=(0.9,0.99) and epsilon=1e-07 and optimizer_args=No additional optimizer arguments
- lr_scheduler_type: cosine
- lr_scheduler_warmup_ratio: 0.1
- num_epochs: 1


### Framework versions

- Transformers 4.48.3
- Pytorch 2.5.1+cu124
- Datasets 3.3.2
- Tokenizers 0.21.0