YAML Metadata Warning: empty or missing yaml metadata in repo card (https://huggingface.co/docs/hub/model-cards#model-card-metadata)

OpenAI Whisper-Base Fine-Tuned Model for AI-transcriptionist

This repository hosts a fine-tuned version of the OpenAI Whisper-Base model optimized for AI-transcriptionist tasks using the Mozilla Common Voice 13.0 dataset. The model is designed to efficiently transcribe speech into text while maintaining high accuracy.

Model Details

  • Model Architecture: OpenAI Whisper-Base
  • Task: AI-transcriptionist
  • Dataset: Mozilla Common Voice 11.0
  • Fine-tuning Framework: Hugging Face Transformers

πŸš€ Usage

Installation

pip install transformers torch

Loading the Model

from transformers import WhisperProcessor, WhisperForConditionalGeneration
import torch

device = "cuda" if torch.cuda.is_available() else "cpu"

model_name = "AventIQ-AI/whisper-AI-transcriptionist"
model = WhisperForConditionalGeneration.from_pretrained(model_name).to(device)
processor = WhisperProcessor.from_pretrained(model_name)

Speech-to-Text Inference

import torchaudio

# Load and process audio file
def load_audio(file_path, target_sampling_rate=16000):
    # Load audio file
    waveform, sample_rate = torchaudio.load(file_path)

    # Convert to mono if stereo
    if waveform.shape[0] > 1:
        waveform = waveform.mean(dim=0, keepdim=True)

    # Resample if needed
    if sample_rate != target_sampling_rate:
        waveform = torchaudio.transforms.Resample(orig_freq=sample_rate, new_freq=target_sampling_rate)(waveform)

    return waveform.squeeze(0).numpy()

input_audio_path = "/kaggle/input/test-data-2/Friday 4h04m pm.m4a"  # Change this to your audio file
audio_array = load_audio(input_audio_path)

input_features = processor(audio_array, sampling_rate=16000, return_tensors="pt").input_features
input_features = input_features.to(device)

forced_decoder_ids = processor.get_decoder_prompt_ids(language="en", task="transcribe")

with torch.no_grad():
    predicted_ids = model.generate(input_features, forced_decoder_ids=forced_decoder_ids)

# Decode output
transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True)[0]

print(f"Transcribed Text: {transcription}")

πŸ“Š Evaluation Results

After fine-tuning the Whisper-Base model for speech-to-text, we evaluated the model's performance on the validation set from the Common Voice 11.0 dataset. The following results were obtained:

Metric Score Meaning
WER 9.2% Word Error Rate: Measures transcription accuracy
CER 5.5% Character Error Rate: Measures character-level accuracy

Fine-Tuning Details

Dataset

The Mozilla Common Voice 11.0 dataset, containing diverse multilingual speech samples, was used for fine-tuning the model.

Training

  • Number of epochs: 6
  • Batch size: 16
  • Evaluation strategy: epochs
  • Learning Rate: 5e-6

πŸ“‚ Repository Structure

.
β”œβ”€β”€ model/               # Contains the quantized model files
β”œβ”€β”€ tokenizer_config/    # Tokenizer configuration and vocabulary files
β”œβ”€β”€ model.safetensors/   # Quantized Model
β”œβ”€β”€ README.md            # Model documentation

⚠️ Limitations

  • The model may struggle with highly noisy or overlapping speech.
  • Performance may vary across different accents and dialects.

🀝 Contributing

Contributions are welcome! Feel free to open an issue or submit a pull request if you have suggestions or improvements.

Downloads last month
7
Safetensors
Model size
72.6M params
Tensor type
FP16
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support