Edit model card

japanese-wav2vec2-base-rs35kh

This model is a wav2vec 2.0 Base fine-tuned on the large-scale Japanese ASR corpus ReazonSpeech v2.0.

Usage

You can use this model through transformers library:

import librosa
import numpy as np
from transformers import AutoProcessor, Wav2Vec2ForCTC

model = Wav2Vec2ForCTC.from_pretrained(
    "reazon-research/japanese-wav2vec2-base-rs35kh",
    torch_dtype=torch.bfloat16,
    attn_implementation="flash_attention_2",
).to("cuda")
processor = AutoProcessor.from_pretrained("reazon-research/japanese-wav2vec2-base-rs35kh")

audio, _ = librosa.load(audio_filepath, sr=16_000)
audio = np.pad(audio, pad_width=int(0.5 * 16_000))  # Recommend to pad audio before inference
input_values = processor(
    audio,
    return_tensors="pt",
    sampling_rate=16_000
).input_values.to("cuda").to(torch.bfloat16)

with torch.inference_mode():
    logits = model(input_values).logits.cpu()
predicted_ids = torch.argmax(logits, dim=-1)[0]
transcription = processor.decode(predicted_ids, skip_special_tokens=True)

Test Results

We report the Character Error Rate (CER) of our model and the other wav2vec2 families.

Model #Prameters⬇ AVERAGE⬇ JSUT-BASIC5000⬇ Common Voice⬇ TEDxJP-10K⬇
reazon-research/japanese-wav2vec2-base-rs35kh 96.7M 20.40% 13.22% 23.76% 24.23%
Ivydata/wav2vec2-large-xlsr-53-japanese 318M 24.23% 13.83% 18.15% 40.72%
jonatasgrosman/wav2vec2-large-xlsr-53-japanese 317M 31.82% 4.25% 40.58% 50.63%
vumichien/wav2vec2-large-xlsr-japanese 318M 39.87% 4.21% 53.29% 62.12%

We also report the CER for long-form speech.

Model #Prameters⬇ JSUT-BOOK⬇
reazon-research/japanese-wav2vec2-base-rs35kh 96.7M 82.84%
Ivydata/wav2vec2-large-xlsr-53-japanese 318M 65.60%
jonatasgrosman/wav2vec2-large-xlsr-53-japanese 317M 46.20%
vumichien/wav2vec2-large-xlsr-japanese 318M 46.52%

Citation

@misc{reazon-research-japanese-wav2vec2-base-rs35kh,
  title={japanese-wav2vec2-base-rs35kh},
  author={Sasaki, Yuta},
  url = {https://huggingface.co/reazon-research/japanese-wav2vec2-base-rs35kh},
  year = {2024}
}

License

Apaceh Licence 2.0

Downloads last month
157
Safetensors
Model size
96.7M params
Tensor type
F32
·
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.

Dataset used to train reazon-research/japanese-wav2vec2-base-rs35kh