Phi-4-mm-inst-asr-turkish-unf / README.md

Update README.md

cb293b2 verified about 2 months ago

5.72 kB

	---
	library_name: transformers
	license: mit
	base_model: microsoft/Phi-4-multimodal-instruct
	tags:
	- generated_from_trainer
	model-index:
	- name: Phi-4-mm-inst-asr-turkish-unf
	results: []
	datasets:
	- ysdede/khanacademy-turkish
	- ysdede/khanacademy-turkish-math
	- ysdede/commonvoice_17_tr_fixed
	language:
	- tr
	---

	<!-- This model card has been generated automatically according to the information the Trainer had access to. You
	should probably proofread and complete it, then remove this comment. -->

	# Phi-4-mm-inst-asr-turkish-unf

	This model is a fine-tuned version of [microsoft/Phi-4-multimodal-instruct](https://huggingface.co/microsoft/Phi-4-multimodal-instruct).

	Model Background:
	This benchmark evaluates a fine-tuned version of Microsoft's Phi-4-mm-instruct, a multimodal model not originally designed for Turkish ASR. Key points:

	1. Initial Limitations:
	- No Turkish ASR support in base model
	- Initial WER 100+%

	2. Fine-Tuning Process:
	- Unfroze encoder layers for Turkish adaptation
	- Trained for 1 epoch on Turkish audio-text pairs

	3. Current Status:
	- Achieved significant WER reduction (100+% → 9.7% on CommonVoice)*
	- Still under active development for better generalization
	- Results shared as incremental progress documentation

	Why This Matters:
	- Demonstrates adaptability of multimodal architectures
	- Provides baseline for Turkish ASR in resource-constrained scenarios
	- Encourages exploration of under-supported languages

	* Note on CommonVoice Results:
	- CommonVoice's relatively low WER (9.7%) may benefit from:
	- Potential speaker leakage between splits (same speakers in train/test)
	- Clean audio conditions despite non-professional recordings
	- Short utterance structure (average 4-5 seconds)
	- See below for full context on CommonVoice characteristics in the "Dataset Notes" section.

	### Benchmark Results

	Testing Environment: Google Colab with L4 GPU (24 GB VRAM)

	\| Model \| WER (%) \| CER (%) \| Inference Speed (xRT) \| Batch Size \| Audio Duration (hrs) \| Samples Processed \|
	\| :--------------------------------- \| -------:\| -------:\| --------------------: \| ----------:\| --------------------:\| -----------------:\|
	\| ysdede/commonvoice_17_tr_fixed \| 9.7 \| 2.72 \| x26 \| 32 \| 7.1 \| 8,576 \|
	\| erenfazlioglu/turkishvoicedataset \| 11.52 \| 3.93 \| x20 \| 16 \| 7.8 \| 2,496 \|
	\| ysdede/khanacademy-turkish \| 12.04 \| 7.78 \| x16 \| 16 \| 3.8 \| 1,344 \|
	\| ysdede/yeni-split-0 \| 20.58 \| 13.2 \| x16 \| 16 \| 18 \| 5,936 \|
	\| ymoslem/MediaSpeech \| 25.48 \| 15.16 \| x35 \| 32 \| 10 \| 2,496 \|
	\| dssnt1 \| 27.23 \| 9.6 \| x12 \| 16 \| 2.5 \| 1,200 \|
	\| ysdede/yeni-split-lq-noisy \| 39.4 \| 27 \| x19 \| 16 \| 12 \| 3,440 \|

	Dataset Notes:
	- Finetuning Datasets:
	- `commonvoice_17_tr_fixed`: Crowd-sourced clean speech (not professional studio recordings) with shuffled splits - potential speaker leakage (same speakers in train/test with different utterances)
	- `khanacademy-turkish`: Educational lectures with STEM vocabulary
	- `yeni-split-0`: Noisy real-world recordings

	- Benchmark-only Datasets:
	- `turkishvoicedataset`: Synthetic TTS news (clean but artificial prosody)
	- `yeni-split-lq-noisy`: Challenging noisy samples with alignment errors

	Text Normalization Challenges:
	⚠️ Current WER/CER scores may be inflated due to:
	1. Lack of standardized Turkish ASR text normalization pipeline
	2. Case/punctuation inconsistencies in references
	3. Agglutinative language morphology affecting word boundaries

	Evaluation Note:
	For Turkish ASR benchmarking, I developed a [text normalizer](https://github.com/ysdede/trnorm) to address language-specific scoring challenges. While imperfect, it helps:
	- Convert numbers/dates to words
	- Standardize compound word formatting
	- Reduce punctuation-related mismatches

	This preprocessing makes WER/CER calculations slightly fairer compared to raw scoring, though manual verification remains recommended. The tool is actively being refined based on validation set findings.

	Performance Factors:
	- CommonVoice's relatively low WER (9.7%) likely benefits from:
	- High audio quality despite non-professional speakers
	- Potential speaker familiarity patterns (same speakers in both splits)
	- Short utterance structure (average 4-5 seconds)


	## Training procedure
	[finetuning Colab notebook](https://colab.research.google.com/drive/1JAQdpX3BtIgDmTLlnHgstKfGw7HjSfej?usp=sharing)

	## Model description

	More information needed

	## Intended uses & limitations

	More information needed

	## Training and evaluation data

	More information needed

	### Training hyperparameters

	The following hyperparameters were used during training:
	- learning_rate: 0.0001
	- train_batch_size: 8
	- eval_batch_size: 8
	- seed: 42
	- optimizer: Use OptimizerNames.ADAMW_TORCH with betas=(0.9,0.99) and epsilon=1e-07 and optimizer_args=No additional optimizer arguments
	- lr_scheduler_type: cosine
	- lr_scheduler_warmup_ratio: 0.1
	- num_epochs: 1


	### Framework versions

	- Transformers 4.48.3
	- Pytorch 2.5.1+cu124
	- Datasets 3.3.2
	- Tokenizers 0.21.0