Create README.md

c5dd9a7 verified about 1 month ago

3.7 kB

	---
	license: mit
	language:
	- ur
	---


	# Urdu Whisper model in Pytorch from scratch implementation

	Trained a small Urdu whisper model


	[Robust Speech Recognition via Large-Scale Weak Supervision](https://cdn.openai.com/papers/whisper.pdf)

	## ModelArgs Hyperparameters

	\| Parameter \| Value \| Description \|
	\|-------------------------\|------------------------\|-----------------------------------------------------------------------------\|
	\| `batch_size` \|128 \| The number of samples processed before the model is updated. \|
	\| `max_lr` \|1.5e-3 \| Maximum learning rate. \|
	\| `dropout` \| 0.1 \| Dropout rate for regularization. \|
	\| `epochs` \|2 \| Number of training epochs. \|
	\| `block_size` \| 64 \| Sequence length (number of tokens or time steps). \|
	\| `tgt_vocab_size` \| 200024 \| Size of the target vocabulary. \|
	\| `embeddings_dims` \| 512 \| Dimensionality of token embeddings. \|
	\| `attn_dropout` \| 0.1 \| Dropout rate for attention layers. \|
	\| `no_of_heads` \| 4 \| Number of attention heads in multi-head attention. \|
	\| `no_of_decoder_layers` \| 6 \| Number of decoder layers in the model. \|
	\| `weight_decay_optim` \| 0.1 \| Weight decay for the optimizer. \|
	\| `log_mel_features` \| 80 \| Number of Mel spectrogram features. \|
	\| `kernel_size` \| 3 \| Kernel size for convolutional layers. \|
	\| `stride` \| 2 \| Stride for convolutional layers. \|
	\| `sr` \| 16000 \| Sampling rate of the audio. \|
	\| `device` \| `'cuda:0'` \| Device to run the model on (e.g., GPU). \|
	\| `SAMPLING_RATE` \| 16000 \| Sampling rate of the audio. \|
	\| `N_MELS` \| 80 \| Number of Mel bins in the spectrogram. \|
	\| `WINDOW_DURATION` \| 0.025 \| Duration of the analysis window in seconds (25 ms). \|
	\| `STRIDE_DURATION` \| 0.010 \| Stride between consecutive windows in seconds (10 ms). \|
	\| `max_t` \| 500 \| Maximum time steps in the spectrogram. \|
	\| `n_channels` \| 80 \| Number of channels in the input spectrogram. \|

	### Dataset

	[Common Voice Corpus 11.0 ](https://huggingface.co/datasets/mozilla-foundation/common_voice_11_0)

	Used the 'xs' snapshot.

	### Frameworks:
	Pytorch


	### Epochs/Steps
	Epochs (train) = 2

	Val iterations = every epoch


	### Loss Curves

	![Train and Val loss curves](image/loss.png)

	---
	license: mit
	language:
	- ur
	---


	# Urdu Whisper model in Pytorch from scratch implementation

	Trained a small Urdu whisper model


	[Robust Speech Recognition via Large-Scale Weak Supervision](https://cdn.openai.com/papers/whisper.pdf)

	## ModelArgs Hyperparameters

	\| Parameter \| Value \| Description \|
	\|-------------------------\|------------------------\|-----------------------------------------------------------------------------\|
	\| `batch_size` \|128 \| The number of samples processed before the model is updated. \|
	\| `max_lr` \|1.5e-3 \| Maximum learning rate. \|
	\| `dropout` \| 0.1 \| Dropout rate for regularization. \|
	\| `epochs` \|2 \| Number of training epochs. \|
	\| `block_size` \| 64 \| Sequence length (number of tokens or time steps). \|
	\| `tgt_vocab_size` \| 200024 \| Size of the target vocabulary. \|
	\| `embeddings_dims` \| 512 \| Dimensionality of token embeddings. \|
	\| `attn_dropout` \| 0.1 \| Dropout rate for attention layers. \|
	\| `no_of_heads` \| 4 \| Number of attention heads in multi-head attention. \|
	\| `no_of_decoder_layers` \| 6 \| Number of decoder layers in the model. \|
	\| `weight_decay_optim` \| 0.1 \| Weight decay for the optimizer. \|
	\| `log_mel_features` \| 80 \| Number of Mel spectrogram features. \|
	\| `kernel_size` \| 3 \| Kernel size for convolutional layers. \|
	\| `stride` \| 2 \| Stride for convolutional layers. \|
	\| `sr` \| 16000 \| Sampling rate of the audio. \|
	\| `device` \| `'cuda:0'` \| Device to run the model on (e.g., GPU). \|
	\| `SAMPLING_RATE` \| 16000 \| Sampling rate of the audio. \|
	\| `N_MELS` \| 80 \| Number of Mel bins in the spectrogram. \|
	\| `WINDOW_DURATION` \| 0.025 \| Duration of the analysis window in seconds (25 ms). \|
	\| `STRIDE_DURATION` \| 0.010 \| Stride between consecutive windows in seconds (10 ms). \|
	\| `max_t` \| 500 \| Maximum time steps in the spectrogram. \|
	\| `n_channels` \| 80 \| Number of channels in the input spectrogram. \|

	### Dataset

	[Common Voice Corpus 11.0 ](https://huggingface.co/datasets/mozilla-foundation/common_voice_11_0)

	Used the 'xs' snapshot.

	### Frameworks:
	Pytorch


	### Epochs/Steps
	Epochs (train) = 2

	Val iterations = every epoch


	### Loss Curves

	![Train and Val loss curves](image/loss.png)