File size: 2,155 Bytes
e833459 3e90250 a4cf49b c349099 a4cf49b c349099 a4cf49b c349099 a4cf49b c349099 a4cf49b 5915124 a4cf49b c349099 5915124 a4cf49b c349099 a4cf49b c349099 a4cf49b c349099 a4cf49b c349099 a4cf49b c349099 a4cf49b c349099 a4cf49b c349099 a4cf49b c349099 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 |
---
language: sw
datasets:
- ALFFA (African Languages in the Field: speech Fundamentals and Automation) - [here](http://www.openslr.org/25/)
metrics:
- wer
tags:
- audio
- automatic-speech-recognition
- speech
- xlsr-fine-tuning-week
license: apache-2.0
model-index:
- name: Swahili XLSR-53 Wav2Vec2.0 Large
results:
- task:
name: Speech Recognition
type: automatic-speech-recognition
dataset:
name: ALFFA sw
args: sw
metrics:
- name: Test WER
type: wer
value: WIP
---
# Wav2Vec2-Large-XLSR-53-Swahili
Fine-tuned [facebook/wav2vec2-large-xlsr-53](https://huggingface.co/facebook/wav2vec2-large-xlsr-53) on Swahili using the [ALFFA](http://www.openslr.org/25/), ... and ... dataset{s}.
When using this model, make sure that your speech input is sampled at 16kHz.
## Usage
The model can be used directly (without a language model) as follows:
```python
import torch
import torchaudio
from datasets import load_dataset
from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor
processor = Wav2Vec2Processor.from_pretrained("alokmatta/wav2vec2-large-xlsr-53-sw")
model = Wav2Vec2ForCTC.from_pretrained("alokmatta/wav2vec2-large-xlsr-53-sw").to("cuda")
resampler = torchaudio.transforms.Resample(48_000, 16_000)
resampler = torchaudio.transforms.Resample(orig_freq=48_000, new_freq=16_000)
def load_file_to_data(file):
batch = {}
speech, _ = torchaudio.load(file)
batch["speech"] = resampler.forward(speech.squeeze(0)).numpy()
batch["sampling_rate"] = resampler.new_freq
return batch
def predict(data):
features = processor(data["speech"], sampling_rate=data["sampling_rate"], padding=True, return_tensors="pt")
input_values = features.input_values.to("cuda")
attention_mask = features.attention_mask.to("cuda")
with torch.no_grad():
logits = model(input_values, attention_mask=attention_mask).logits
pred_ids = torch.argmax(logits, dim=-1)
return processor.batch_decode(pred_ids)
predict(load_file_to_data('./demo.wav'))
```
**Test Result**: WIP %
## Training
The script used for training can be found Here- Coming Soon! |