metadata
language:
- ja
tags:
- automatic-speech-recognition
- whisper
- japanese
- distillation
- common-voice
- asr
license: apache-2.0
datasets:
- mozilla-foundation/common_voice_17_0
pipeline_tag: automatic-speech-recognition
model-index:
- name: whisper-small-ja-distill
results:
- task:
type: automatic-speech-recognition
dataset:
name: Common Voice 17.0 (ja)
type: mozilla-foundation/common_voice_17_0
args: ja
metrics:
- name: cer
type: cer
value: 0.25858
library_name: transformers
metrics:
- cer
faster-whisper-ja-distill
This model was distilled by feeding the output from inferencing the teacher model (Whisper) with Japanese datasets into the student model. Model details are described below.
Teacher: deepdml/faster-whisper-large-v3-turbo-ct2 (CT2)
Student: openai/whisper-small
※ Distilled whisper-base model of deepdml/faster-whisper-large-v3-turbo-ct2 is comming soon. It's not available now.
Training
- Data: Common Voice 17.0 (ja), CC-BY-4.0
- Distillation: hard
- Precision: FP32, gradient checkpointing
- hardware:
- 24GB VRAM (RTX A5000)
- AMD EPYC 7H12
- 54GB DRAM
The Others draining detail is published in wandb project
Evaluation
Conditions
- Dataset: Common Voice 17.0 (ja) validation
- Sample size: N=1000
- Audio: 16 kHz / mono
- Decoding:
num_beams=2
,max_length=225
- Language/Task:
language="ja"
,task="transcribe"
- Chunking:
chunk_length_s=30
,stride_length_s=(5,5)
(Transformers ASR pipeline)
Metrics W&B: Run / Comparison
Usage
import torch
import torchaudio
from transformers import WhisperProcessor, WhisperForConditionalGeneration, pipeline
processor = WhisperProcessor.from_pretrained("openai/whisper-small") # tokenizer+feature_extractor
model = WhisperForConditionalGeneration.from_pretrained("zary0/faster-whisper-ja-distill")
#model.eval()
forced_ids = processor.get_decoder_prompt_ids(language="ja", task="transcribe")
model.generation_config.forced_decoder_ids = forced_ids
model.config.forced_decoder_ids = forced_ids
asr = pipeline(
"automatic-speech-recognition",
model=model,
tokenizer=processor.tokenizer,
feature_extractor=processor.feature_extractor,
chunk_length_s=30,
stride_length_s=(5, 5),
return_timestamps=False,
device=0 if torch.cuda.is_available() else -1,
generate_kwargs={
"max_length": 225,
"num_beams": 1,
"forced_decoder_ids": forced_ids,
},
)
result = asr("sample.mp3")
print(result["text"])
Convert CT2
It's here
License / Attribution
Model: Apache-2.0
Data: Common Voice 17.0 (ja) © Mozilla, CC-BY-4.0. Provide attribution.