File size: 5,723 Bytes
8ab6d22 59b624a 8ab6d22 59b624a 1559f3f 59b624a 1559f3f 59b624a cb293b2 59b624a 96fff72 5ff90f5 96fff72 8ab6d22 59b624a |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 |
---
library_name: transformers
license: mit
base_model: microsoft/Phi-4-multimodal-instruct
tags:
- generated_from_trainer
model-index:
- name: Phi-4-mm-inst-asr-turkish-unf
results: []
datasets:
- ysdede/khanacademy-turkish
- ysdede/khanacademy-turkish-math
- ysdede/commonvoice_17_tr_fixed
language:
- tr
---
<!-- This model card has been generated automatically according to the information the Trainer had access to. You
should probably proofread and complete it, then remove this comment. -->
# Phi-4-mm-inst-asr-turkish-unf
This model is a fine-tuned version of [microsoft/Phi-4-multimodal-instruct](https://huggingface.co/microsoft/Phi-4-multimodal-instruct).
**Model Background**:
This benchmark evaluates a fine-tuned version of Microsoft's **Phi-4-mm-instruct**, a multimodal model not originally designed for Turkish ASR. Key points:
1. **Initial Limitations**:
- No Turkish ASR support in base model
- Initial WER 100+%
2. **Fine-Tuning Process**:
- Unfroze encoder layers for Turkish adaptation
- Trained for 1 epoch on Turkish audio-text pairs
3. **Current Status**:
- Achieved significant WER reduction (100+% → 9.7% on CommonVoice)*
- Still under active development for better generalization
- Results shared as incremental progress documentation
**Why This Matters**:
- Demonstrates adaptability of multimodal architectures
- Provides baseline for Turkish ASR in resource-constrained scenarios
- Encourages exploration of under-supported languages
* **Note on CommonVoice Results**:
- CommonVoice's relatively low WER (9.7%) may benefit from:
- Potential speaker leakage between splits (same speakers in train/test)
- Clean audio conditions despite non-professional recordings
- Short utterance structure (average 4-5 seconds)
- See **below** for full context on CommonVoice characteristics in the "Dataset Notes" section.
### Benchmark Results
**Testing Environment**: Google Colab with L4 GPU (24 GB VRAM)
| Model | WER (%) | CER (%) | Inference Speed (xRT) | Batch Size | Audio Duration (hrs) | Samples Processed |
| :--------------------------------- | -------:| -------:| --------------------: | ----------:| --------------------:| -----------------:|
| ysdede/commonvoice_17_tr_fixed | 9.7 | 2.72 | x26 | 32 | 7.1 | 8,576 |
| erenfazlioglu/turkishvoicedataset | 11.52 | 3.93 | x20 | 16 | 7.8 | 2,496 |
| ysdede/khanacademy-turkish | 12.04 | 7.78 | x16 | 16 | 3.8 | 1,344 |
| ysdede/yeni-split-0 | 20.58 | 13.2 | x16 | 16 | 18 | 5,936 |
| ymoslem/MediaSpeech | 25.48 | 15.16 | x35 | 32 | 10 | 2,496 |
| dssnt1 | 27.23 | 9.6 | x12 | 16 | 2.5 | 1,200 |
| ysdede/yeni-split-lq-noisy | 39.4 | 27 | x19 | 16 | 12 | 3,440 |
**Dataset Notes**:
- **Finetuning Datasets**:
- `commonvoice_17_tr_fixed`: Crowd-sourced clean speech (not professional studio recordings) with shuffled splits - potential **speaker leakage** (same speakers in train/test with different utterances)
- `khanacademy-turkish`: Educational lectures with STEM vocabulary
- `yeni-split-0`: Noisy real-world recordings
- **Benchmark-only Datasets**:
- `turkishvoicedataset`: Synthetic TTS news (clean but artificial prosody)
- `yeni-split-lq-noisy`: Challenging noisy samples with alignment errors
**Text Normalization Challenges**:
⚠️ Current WER/CER scores may be inflated due to:
1. Lack of standardized Turkish ASR text normalization pipeline
2. Case/punctuation inconsistencies in references
3. Agglutinative language morphology affecting word boundaries
**Evaluation Note**:
For Turkish ASR benchmarking, I developed a [text normalizer](https://github.com/ysdede/trnorm) to address language-specific scoring challenges. While imperfect, it helps:
- Convert numbers/dates to words
- Standardize compound word formatting
- Reduce punctuation-related mismatches
This preprocessing makes WER/CER calculations slightly fairer compared to raw scoring, though manual verification remains recommended. The tool is actively being refined based on validation set findings.
**Performance Factors**:
- CommonVoice's relatively low WER (9.7%) likely benefits from:
- High audio quality despite non-professional speakers
- Potential speaker familiarity patterns (same speakers in both splits)
- Short utterance structure (average 4-5 seconds)
## Training procedure
[finetuning Colab notebook](https://colab.research.google.com/drive/1JAQdpX3BtIgDmTLlnHgstKfGw7HjSfej?usp=sharing)
## Model description
More information needed
## Intended uses & limitations
More information needed
## Training and evaluation data
More information needed
### Training hyperparameters
The following hyperparameters were used during training:
- learning_rate: 0.0001
- train_batch_size: 8
- eval_batch_size: 8
- seed: 42
- optimizer: Use OptimizerNames.ADAMW_TORCH with betas=(0.9,0.99) and epsilon=1e-07 and optimizer_args=No additional optimizer arguments
- lr_scheduler_type: cosine
- lr_scheduler_warmup_ratio: 0.1
- num_epochs: 1
### Framework versions
- Transformers 4.48.3
- Pytorch 2.5.1+cu124
- Datasets 3.3.2
- Tokenizers 0.21.0 |