File size: 5,723 Bytes
8ab6d22
 
 
 
 
 
 
 
 
59b624a
 
 
 
 
 
8ab6d22
 
 
 
 
 
 
59b624a
 
 
 
 
 
 
1559f3f
59b624a
 
 
 
 
 
1559f3f
59b624a
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
cb293b2
59b624a
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
96fff72
5ff90f5
96fff72
8ab6d22
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
59b624a
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
---
library_name: transformers
license: mit
base_model: microsoft/Phi-4-multimodal-instruct
tags:
- generated_from_trainer
model-index:
- name: Phi-4-mm-inst-asr-turkish-unf
  results: []
datasets:
- ysdede/khanacademy-turkish
- ysdede/khanacademy-turkish-math
- ysdede/commonvoice_17_tr_fixed
language:
- tr
---

<!-- This model card has been generated automatically according to the information the Trainer had access to. You
should probably proofread and complete it, then remove this comment. -->

# Phi-4-mm-inst-asr-turkish-unf

This model is a fine-tuned version of [microsoft/Phi-4-multimodal-instruct](https://huggingface.co/microsoft/Phi-4-multimodal-instruct).

**Model Background**:  
This benchmark evaluates a fine-tuned version of Microsoft's **Phi-4-mm-instruct**, a multimodal model not originally designed for Turkish ASR. Key points:  

1. **Initial Limitations**:  
   - No Turkish ASR support in base model  
   - Initial WER 100+%  

2. **Fine-Tuning Process**:  
   - Unfroze encoder layers for Turkish adaptation  
   - Trained for 1 epoch on Turkish audio-text pairs  

3. **Current Status**:  
   - Achieved significant WER reduction (100+% → 9.7% on CommonVoice)*  
   - Still under active development for better generalization  
   - Results shared as incremental progress documentation  

**Why This Matters**:  
- Demonstrates adaptability of multimodal architectures  
- Provides baseline for Turkish ASR in resource-constrained scenarios  
- Encourages exploration of under-supported languages  

* **Note on CommonVoice Results**:  
   - CommonVoice's relatively low WER (9.7%) may benefit from:  
     - Potential speaker leakage between splits (same speakers in train/test)  
     - Clean audio conditions despite non-professional recordings  
     - Short utterance structure (average 4-5 seconds)  
   - See **below** for full context on CommonVoice characteristics in the "Dataset Notes" section.

### Benchmark Results

**Testing Environment**: Google Colab with L4 GPU (24 GB VRAM)  

| Model                              | WER (%) | CER (%) | Inference Speed (xRT) | Batch Size | Audio Duration (hrs) | Samples Processed |  
| :--------------------------------- | -------:| -------:| --------------------: | ----------:| --------------------:| -----------------:|  
| ysdede/commonvoice_17_tr_fixed     | 9.7     | 2.72    | x26                  | 32         | 7.1                  | 8,576             |  
| erenfazlioglu/turkishvoicedataset  | 11.52   | 3.93    | x20                  | 16         | 7.8                  | 2,496             |  
| ysdede/khanacademy-turkish         | 12.04   | 7.78    | x16                  | 16         | 3.8                  | 1,344             |  
| ysdede/yeni-split-0                | 20.58   | 13.2    | x16                  | 16         | 18                   | 5,936             |  
| ymoslem/MediaSpeech                | 25.48   | 15.16   | x35                  | 32         | 10                   | 2,496             |  
| dssnt1                             | 27.23   | 9.6     | x12                  | 16         | 2.5                  | 1,200             |  
| ysdede/yeni-split-lq-noisy         | 39.4    | 27      | x19                  | 16         | 12                   | 3,440             |  

**Dataset Notes**:  
- **Finetuning Datasets**:  
  - `commonvoice_17_tr_fixed`: Crowd-sourced clean speech (not professional studio recordings) with shuffled splits - potential **speaker leakage** (same speakers in train/test with different utterances)  
  - `khanacademy-turkish`: Educational lectures with STEM vocabulary  
  - `yeni-split-0`: Noisy real-world recordings  

- **Benchmark-only Datasets**:  
  - `turkishvoicedataset`: Synthetic TTS news (clean but artificial prosody)  
  - `yeni-split-lq-noisy`: Challenging noisy samples with alignment errors  

**Text Normalization Challenges**:  
⚠️ Current WER/CER scores may be inflated due to:  
1. Lack of standardized Turkish ASR text normalization pipeline  
2. Case/punctuation inconsistencies in references  
3. Agglutinative language morphology affecting word boundaries  

**Evaluation Note**:  
For Turkish ASR benchmarking, I developed a [text normalizer](https://github.com/ysdede/trnorm) to address language-specific scoring challenges. While imperfect, it helps:
- Convert numbers/dates to words  
- Standardize compound word formatting  
- Reduce punctuation-related mismatches  

This preprocessing makes WER/CER calculations slightly fairer compared to raw scoring, though manual verification remains recommended. The tool is actively being refined based on validation set findings.

**Performance Factors**:  
- CommonVoice's relatively low WER (9.7%) likely benefits from:  
  - High audio quality despite non-professional speakers  
  - Potential speaker familiarity patterns (same speakers in both splits)  
  - Short utterance structure (average 4-5 seconds)  


## Training procedure
[finetuning Colab notebook](https://colab.research.google.com/drive/1JAQdpX3BtIgDmTLlnHgstKfGw7HjSfej?usp=sharing)

## Model description

More information needed

## Intended uses & limitations

More information needed

## Training and evaluation data

More information needed

### Training hyperparameters

The following hyperparameters were used during training:
- learning_rate: 0.0001
- train_batch_size: 8
- eval_batch_size: 8
- seed: 42
- optimizer: Use OptimizerNames.ADAMW_TORCH with betas=(0.9,0.99) and epsilon=1e-07 and optimizer_args=No additional optimizer arguments
- lr_scheduler_type: cosine
- lr_scheduler_warmup_ratio: 0.1
- num_epochs: 1


### Framework versions

- Transformers 4.48.3
- Pytorch 2.5.1+cu124
- Datasets 3.3.2
- Tokenizers 0.21.0