developerPushkal's picture
Create README.md
104b157 verified
# OpenAI Whisper-Base Fine-Tuned Model for AI-transcriptionist
This repository hosts a fine-tuned version of the OpenAI Whisper-Base model optimized for AI-transcriptionist tasks using the [Mozilla Common Voice 13.0](https://commonvoice.mozilla.org/) dataset. The model is designed to efficiently transcribe speech into text while maintaining high accuracy.
## Model Details
- **Model Architecture**: OpenAI Whisper-Base
- **Task**: AI-transcriptionist
- **Dataset**: [Mozilla Common Voice 11.0](https://commonvoice.mozilla.org/)
- **Fine-tuning Framework**: Hugging Face Transformers
## πŸš€ Usage
### Installation
```bash
pip install transformers torch
```
### Loading the Model
```python
from transformers import WhisperProcessor, WhisperForConditionalGeneration
import torch
device = "cuda" if torch.cuda.is_available() else "cpu"
model_name = "AventIQ-AI/whisper-AI-transcriptionist"
model = WhisperForConditionalGeneration.from_pretrained(model_name).to(device)
processor = WhisperProcessor.from_pretrained(model_name)
```
### Speech-to-Text Inference
```python
import torchaudio
# Load and process audio file
def load_audio(file_path, target_sampling_rate=16000):
# Load audio file
waveform, sample_rate = torchaudio.load(file_path)
# Convert to mono if stereo
if waveform.shape[0] > 1:
waveform = waveform.mean(dim=0, keepdim=True)
# Resample if needed
if sample_rate != target_sampling_rate:
waveform = torchaudio.transforms.Resample(orig_freq=sample_rate, new_freq=target_sampling_rate)(waveform)
return waveform.squeeze(0).numpy()
input_audio_path = "/kaggle/input/test-data-2/Friday 4h04m pm.m4a" # Change this to your audio file
audio_array = load_audio(input_audio_path)
input_features = processor(audio_array, sampling_rate=16000, return_tensors="pt").input_features
input_features = input_features.to(device)
forced_decoder_ids = processor.get_decoder_prompt_ids(language="en", task="transcribe")
with torch.no_grad():
predicted_ids = model.generate(input_features, forced_decoder_ids=forced_decoder_ids)
# Decode output
transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True)[0]
print(f"Transcribed Text: {transcription}")
```
## πŸ“Š Evaluation Results
After fine-tuning the Whisper-Base model for speech-to-text, we evaluated the model's performance on the validation set from the Common Voice 11.0 dataset. The following results were obtained:
| Metric | Score | Meaning |
|------------|--------|------------------------------------------------|
| **WER** | 9.2% | Word Error Rate: Measures transcription accuracy |
| **CER** | 5.5% | Character Error Rate: Measures character-level accuracy |
## Fine-Tuning Details
### Dataset
The Mozilla Common Voice 11.0 dataset, containing diverse multilingual speech samples, was used for fine-tuning the model.
### Training
- **Number of epochs**: 6
- **Batch size**: 16
- **Evaluation strategy**: epochs
- **Learning Rate**: 5e-6
## πŸ“‚ Repository Structure
```bash
.
β”œβ”€β”€ model/ # Contains the quantized model files
β”œβ”€β”€ tokenizer_config/ # Tokenizer configuration and vocabulary files
β”œβ”€β”€ model.safetensors/ # Quantized Model
β”œβ”€β”€ README.md # Model documentation
```
## ⚠️ Limitations
- The model may struggle with highly noisy or overlapping speech.
- Performance may vary across different accents and dialects.
## 🀝 Contributing
Contributions are welcome! Feel free to open an issue or submit a pull request if you have suggestions or improvements.