japanese-hubert-base-phoneme-ctc

rinna/japanese-hubert-base を CTC での日本語音素認識にファインチューニングしたモデルです。

モデル概要

  • ReazonSpeech v2 データセットを使用し、pyopenjtalk-plus で生成した音素ラベルを正解と見做して rinna/japanese-hubert-base をファインチューニング
  • 0.3 エポック程度学習したのち、JSUT コーパス (ラベル: https://github.com/sarulab-speech/jsut-label) に対する精度が最も良いチェックポイントを選択

ハイパーパラメータ

  • 学習率
    • CTC Head: 2e-5
    • 他: 2e-6
  • バッチサイズ: 32
  • 最大音声サンプル数: 250000
  • 最適化: AdamW
    • betas: (0.9, 0.98)
    • weight_decay: 0.01
  • 学習率スケジューリング: Cosine
    • Warmup ステップ数: 10000
    • 最大ステップ数: 800000
      • ただし、途中で JSUT での精度が改善されなくなったため 200000 ステップで打ち切り

使用例

import librosa
import numpy as np
import torch
from transformers import HubertForCTC, Wav2Vec2Processor

MODEL_NAME = "prj-beatrice/japanese-hubert-base-phoneme-ctc"
model = HubertForCTC.from_pretrained(MODEL_NAME)
processor = Wav2Vec2Processor.from_pretrained(MODEL_NAME)

audio, sr = librosa.load("audio.wav", sr=16000)
audio = np.concatenate([np.zeros(sr), audio, np.zeros(sr // 2)])

inputs = processor(audio, sampling_rate=sr, return_tensors="pt")
with torch.no_grad():
    outputs = model(**inputs)
    predicted_ids = outputs.logits.argmax(-1)
    phonemes = processor.decode(predicted_ids[0], spaces_between_special_tokens=True)

print(phonemes)
# => "m i z u o m a r e e sh i a k a r a k a w a n a k U t e w a n a r a n a i n o d e s U"

学習環境

  • A100 80GB
  • Python 3.10.12
absl-py==2.3.0
accelerate==1.7.0
aiohappyeyeballs==2.6.1
aiohttp==3.12.13
aiosignal==1.3.2
annotated-types==0.7.0
async-timeout==5.0.1
attrs==25.3.0
audioread==3.0.1
certifi==2025.6.15
cffi==1.17.1
charset-normalizer==3.4.2
click==8.2.1
coloredlogs==15.0.1
coverage==7.9.1
datasets==3.6.0
decorator==5.2.1
dill==0.3.8
evaluate==0.4.3
exceptiongroup==1.3.0
filelock==3.18.0
flatbuffers==25.2.10
frozenlist==1.7.0
fsspec==2025.3.0
gitdb==4.0.12
gitpython==3.1.44
grpcio==1.73.0
hf-xet==1.1.3
huggingface-hub==0.33.0
humanfriendly==10.0
idna==3.10
iniconfig==2.1.0
jinja2==3.1.6
jiwer==3.1.0
joblib==1.5.1
lazy-loader==0.4
librosa==0.11.0
llvmlite==0.44.0
markdown==3.8
markupsafe==3.0.2
mpmath==1.3.0
msgpack==1.1.1
multidict==6.4.4
multiprocess==0.70.16
networkx==3.4.2
numba==0.61.2
numpy==2.2.6
nvidia-cublas-cu12==12.6.4.1
nvidia-cuda-cupti-cu12==12.6.80
nvidia-cuda-nvrtc-cu12==12.6.77
nvidia-cuda-runtime-cu12==12.6.77
nvidia-cudnn-cu12==9.5.1.17
nvidia-cufft-cu12==11.3.0.4
nvidia-cufile-cu12==1.11.1.6
nvidia-curand-cu12==10.3.7.77
nvidia-cusolver-cu12==11.7.1.2
nvidia-cusparse-cu12==12.5.4.2
nvidia-cusparselt-cu12==0.6.3
nvidia-nccl-cu12==2.26.2
nvidia-nvjitlink-cu12==12.6.85
nvidia-nvtx-cu12==12.6.77
onnxruntime==1.22.0
packaging==25.0
pandas==2.3.0
platformdirs==4.3.8
pluggy==1.6.0
pooch==1.8.2
propcache==0.3.2
protobuf==6.31.1
psutil==7.0.0
pyarrow==20.0.0
pycparser==2.22
pydantic==2.11.7
pydantic-core==2.33.2
pygments==2.19.1
pyopenjtalk-plus==0.4.1.post3
pytest==8.4.0
pytest-cov==6.2.1
python-dateutil==2.9.0.post0
pytz==2025.2
pyyaml==6.0.2
rapidfuzz==3.13.0
regex==2024.11.6
requests==2.32.4
ruff==0.11.13
safetensors==0.5.3
scikit-learn==1.7.0
scipy==1.15.3
sentry-sdk==2.30.0
setproctitle==1.3.6
setuptools==80.9.0
six==1.17.0
smmap==5.0.2
soundfile==0.13.1
soxr==0.5.0.post1
sudachidict-core==20250515
sudachipy==0.6.10
sympy==1.14.0
tensorboard==2.19.0
tensorboard-data-server==0.7.2
threadpoolctl==3.6.0
tokenizers==0.21.1
tomli==2.2.1
torch==2.7.1
torchaudio==2.7.1
tqdm==4.67.1
transformers==4.52.4
triton==3.3.1
typing-extensions==4.14.0
typing-inspection==0.4.1
tzdata==2025.2
urllib3==2.4.0
wandb==0.20.1
werkzeug==3.1.3
xxhash==3.5.0
yarl==1.20.1
Downloads last month
181
Safetensors
Model size
94.4M params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for prj-beatrice/japanese-hubert-base-phoneme-ctc

Finetuned
(48)
this model

Dataset used to train prj-beatrice/japanese-hubert-base-phoneme-ctc

Space using prj-beatrice/japanese-hubert-base-phoneme-ctc 1