developerPushkal commited on
Commit
104b157
Β·
verified Β·
1 Parent(s): 0fa0eaa

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +100 -0
README.md ADDED
@@ -0,0 +1,100 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # OpenAI Whisper-Base Fine-Tuned Model for AI-transcriptionist
2
+
3
+ This repository hosts a fine-tuned version of the OpenAI Whisper-Base model optimized for AI-transcriptionist tasks using the [Mozilla Common Voice 13.0](https://commonvoice.mozilla.org/) dataset. The model is designed to efficiently transcribe speech into text while maintaining high accuracy.
4
+
5
+ ## Model Details
6
+ - **Model Architecture**: OpenAI Whisper-Base
7
+ - **Task**: AI-transcriptionist
8
+ - **Dataset**: [Mozilla Common Voice 11.0](https://commonvoice.mozilla.org/)
9
+ - **Fine-tuning Framework**: Hugging Face Transformers
10
+
11
+ ## πŸš€ Usage
12
+
13
+ ### Installation
14
+ ```bash
15
+ pip install transformers torch
16
+ ```
17
+
18
+ ### Loading the Model
19
+ ```python
20
+ from transformers import WhisperProcessor, WhisperForConditionalGeneration
21
+ import torch
22
+
23
+ device = "cuda" if torch.cuda.is_available() else "cpu"
24
+
25
+ model_name = "AventIQ-AI/whisper-AI-transcriptionist"
26
+ model = WhisperForConditionalGeneration.from_pretrained(model_name).to(device)
27
+ processor = WhisperProcessor.from_pretrained(model_name)
28
+ ```
29
+
30
+ ### Speech-to-Text Inference
31
+ ```python
32
+ import torchaudio
33
+
34
+ # Load and process audio file
35
+ def load_audio(file_path, target_sampling_rate=16000):
36
+ # Load audio file
37
+ waveform, sample_rate = torchaudio.load(file_path)
38
+
39
+ # Convert to mono if stereo
40
+ if waveform.shape[0] > 1:
41
+ waveform = waveform.mean(dim=0, keepdim=True)
42
+
43
+ # Resample if needed
44
+ if sample_rate != target_sampling_rate:
45
+ waveform = torchaudio.transforms.Resample(orig_freq=sample_rate, new_freq=target_sampling_rate)(waveform)
46
+
47
+ return waveform.squeeze(0).numpy()
48
+
49
+ input_audio_path = "/kaggle/input/test-data-2/Friday 4h04m pm.m4a" # Change this to your audio file
50
+ audio_array = load_audio(input_audio_path)
51
+
52
+ input_features = processor(audio_array, sampling_rate=16000, return_tensors="pt").input_features
53
+ input_features = input_features.to(device)
54
+
55
+ forced_decoder_ids = processor.get_decoder_prompt_ids(language="en", task="transcribe")
56
+
57
+ with torch.no_grad():
58
+ predicted_ids = model.generate(input_features, forced_decoder_ids=forced_decoder_ids)
59
+
60
+ # Decode output
61
+ transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True)[0]
62
+
63
+ print(f"Transcribed Text: {transcription}")
64
+ ```
65
+
66
+ ## πŸ“Š Evaluation Results
67
+ After fine-tuning the Whisper-Base model for speech-to-text, we evaluated the model's performance on the validation set from the Common Voice 11.0 dataset. The following results were obtained:
68
+
69
+ | Metric | Score | Meaning |
70
+ |------------|--------|------------------------------------------------|
71
+ | **WER** | 9.2% | Word Error Rate: Measures transcription accuracy |
72
+ | **CER** | 5.5% | Character Error Rate: Measures character-level accuracy |
73
+
74
+ ## Fine-Tuning Details
75
+
76
+ ### Dataset
77
+ The Mozilla Common Voice 11.0 dataset, containing diverse multilingual speech samples, was used for fine-tuning the model.
78
+
79
+ ### Training
80
+ - **Number of epochs**: 6
81
+ - **Batch size**: 16
82
+ - **Evaluation strategy**: epochs
83
+ - **Learning Rate**: 5e-6
84
+
85
+ ## πŸ“‚ Repository Structure
86
+ ```bash
87
+ .
88
+ β”œβ”€β”€ model/ # Contains the quantized model files
89
+ β”œβ”€β”€ tokenizer_config/ # Tokenizer configuration and vocabulary files
90
+ β”œβ”€β”€ model.safetensors/ # Quantized Model
91
+ β”œβ”€β”€ README.md # Model documentation
92
+ ```
93
+
94
+ ## ⚠️ Limitations
95
+ - The model may struggle with highly noisy or overlapping speech.
96
+ - Performance may vary across different accents and dialects.
97
+
98
+ ## 🀝 Contributing
99
+ Contributions are welcome! Feel free to open an issue or submit a pull request if you have suggestions or improvements.
100
+