scottykwok
/

wav2vec2-large-xlsr-cantonese

Automatic Speech Recognition

Model card Files Files and versions Community

wav2vec2-large-xlsr-cantonese / README.md

scottykwok's picture

Update README.md

bae7b44 almost 3 years ago

|

history blame contribute delete

1.69 kB

	---
	language: zh
	tags:
	- automatic-speech-recognition
	license: cc-by-sa-4.0
	datasets:
	- common_voice
	metrics:
	- cer
	---

	# Wav2vec2-large-xlsr-cantonese
	This model was based on [wav2vec2-large-xlsr-53](https://huggingface.co/facebook/wav2vec2-large-xlsr-53), finetuned using Common Voice/zh-HK/6.1.0.

	The training code is similar to [user ctl](https://huggingface.co/ctl/wav2vec2-large-xlsr-cantonese), except that the number of training epochs was 80 (doubled) and fp16_backend is apex. The model was trained using a single RTX 3090 and docker image is nvidia/cuda:11.1-cudnn8-devel.

	CER is 15.11% when evaluate against common voice zh-HK test set.

	# Result (CER)
	15.11%

	# Source Code
	See this GitHub Repo [cantonese-selfish-project](https://github.com/scottykwok/cantonese-selfish-project/) and [demo video](https://youtu.be/k_9RQ-ilGEc).

	# Usage
	```python
	import soundfile as sf
	import torch
	from datasets import load_dataset
	from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor

	# load pretrained model
	processor = Wav2Vec2Processor.from_pretrained("scottykwok/wav2vec2-large-xlsr-cantonese")
	model = Wav2Vec2ForCTC.from_pretrained("scottykwok/wav2vec2-large-xlsr-cantonese")

	# load audio - must be 16kHz mono
	audio_input, sample_rate = sf.read('audio.wav')

	# pad input values and return pt tensor
	input_values = processor(audio_input, sampling_rate=sample_rate, return_tensors="pt").input_values

	# INFERENCE
	# retrieve logits & take argmax
	logits = model(input_values).logits
	predicted_ids = torch.argmax(logits, dim=-1)

	# transcribe
	transcription = processor.decode(predicted_ids[0])
	print("-" *20)
	print("Transcription:\n", transcription.lower())
	print("-" *20)

	```