faster-whisper-ja-distill / README.md

zary0

Update README.md

0937ad0 verified about 1 month ago

preview code

raw

history blame contribute delete

3.09 kB

metadata

language:
  - ja
tags:
  - automatic-speech-recognition
  - whisper
  - japanese
  - distillation
  - common-voice
  - asr
license: apache-2.0
datasets:
  - mozilla-foundation/common_voice_17_0
pipeline_tag: automatic-speech-recognition
model-index:
  - name: whisper-small-ja-distill
    results:
      - task:
          type: automatic-speech-recognition
        dataset:
          name: Common Voice 17.0 (ja)
          type: mozilla-foundation/common_voice_17_0
          args: ja
        metrics:
          - name: cer
            type: cer
            value: 0.25858
library_name: transformers
metrics:
  - cer

faster-whisper-ja-distill

This model was distilled by feeding the output from inferencing the teacher model (Whisper) with Japanese datasets into the student model. Model details are described below.

Teacher: deepdml/faster-whisper-large-v3-turbo-ct2 (CT2)

Student: openai/whisper-small

※ Distilled whisper-base model of deepdml/faster-whisper-large-v3-turbo-ct2 is comming soon. It's not available now.

Training

Data: Common Voice 17.0 (ja), CC-BY-4.0
Distillation: hard
Precision: FP32, gradient checkpointing
hardware:
- 24GB VRAM (RTX A5000)
- AMD EPYC 7H12
- 54GB DRAM

The Others draining detail is published in wandb project

Evaluation

Conditions

Dataset: Common Voice 17.0 (ja) validation
Sample size: N=1000
Audio: 16 kHz / mono
Decoding: num_beams=2, max_length=225
Language/Task: language="ja", task="transcribe"
Chunking: chunk_length_s=30, stride_length_s=(5,5)（Transformers ASR pipeline）

Metrics W&B: Run / Comparison

Usage

import torch
import torchaudio
from transformers import WhisperProcessor, WhisperForConditionalGeneration, pipeline

processor = WhisperProcessor.from_pretrained("openai/whisper-small")  # tokenizer+feature_extractor
model = WhisperForConditionalGeneration.from_pretrained("zary0/faster-whisper-ja-distill")

#model.eval()

forced_ids = processor.get_decoder_prompt_ids(language="ja", task="transcribe")

model.generation_config.forced_decoder_ids = forced_ids
model.config.forced_decoder_ids = forced_ids

asr = pipeline(
    "automatic-speech-recognition",
    model=model,
    tokenizer=processor.tokenizer,
    feature_extractor=processor.feature_extractor,
    chunk_length_s=30,            
    stride_length_s=(5, 5),        
    return_timestamps=False,     
    device=0 if torch.cuda.is_available() else -1,
    generate_kwargs={
        "max_length": 225,
        "num_beams": 1,
        "forced_decoder_ids": forced_ids,    
    },
)

result = asr("sample.mp3")
print(result["text"])

Convert CT2

It's here

License / Attribution

Model: Apache-2.0