File size: 2,155 Bytes
e833459
3e90250
a4cf49b
c349099
a4cf49b
 
 
 
 
 
 
 
 
c349099
a4cf49b
 
 
 
 
c349099
 
a4cf49b
 
 
c349099
a4cf49b
 
5915124
a4cf49b
c349099
5915124
a4cf49b
 
 
 
 
 
 
 
 
 
 
 
 
c349099
a4cf49b
c349099
a4cf49b
 
 
c349099
a4cf49b
c349099
 
 
 
 
 
a4cf49b
 
c349099
 
 
 
 
 
 
 
a4cf49b
c349099
a4cf49b
 
c349099
a4cf49b
 
 
 
 
c349099
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
---
language: sw  
datasets:
- ALFFA (African Languages in the Field: speech Fundamentals and Automation) - [here](http://www.openslr.org/25/)
metrics:
- wer
tags:
- audio
- automatic-speech-recognition
- speech
- xlsr-fine-tuning-week
license: apache-2.0
model-index:
- name: Swahili XLSR-53 Wav2Vec2.0 Large
  results:
  - task: 
      name: Speech Recognition
      type: automatic-speech-recognition
    dataset:
      name: ALFFA sw
      args: sw
    metrics:
       - name: Test WER
         type: wer
         value: WIP
---

# Wav2Vec2-Large-XLSR-53-Swahili 

Fine-tuned [facebook/wav2vec2-large-xlsr-53](https://huggingface.co/facebook/wav2vec2-large-xlsr-53) on Swahili using the [ALFFA](http://www.openslr.org/25/), ... and ... dataset{s}.

When using this model, make sure that your speech input is sampled at 16kHz.

## Usage

The model can be used directly (without a language model) as follows:

```python
import torch
import torchaudio
from datasets import load_dataset
from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor


processor = Wav2Vec2Processor.from_pretrained("alokmatta/wav2vec2-large-xlsr-53-sw")

model = Wav2Vec2ForCTC.from_pretrained("alokmatta/wav2vec2-large-xlsr-53-sw").to("cuda")

resampler = torchaudio.transforms.Resample(48_000, 16_000)

resampler = torchaudio.transforms.Resample(orig_freq=48_000, new_freq=16_000)

def load_file_to_data(file):
    batch = {}
    speech, _ = torchaudio.load(file)
    batch["speech"] = resampler.forward(speech.squeeze(0)).numpy()
    batch["sampling_rate"] = resampler.new_freq
    return batch


def predict(data):
    features = processor(data["speech"], sampling_rate=data["sampling_rate"], padding=True, return_tensors="pt")
    input_values = features.input_values.to("cuda")
    attention_mask = features.attention_mask.to("cuda")
    with torch.no_grad():
        logits = model(input_values, attention_mask=attention_mask).logits
    pred_ids = torch.argmax(logits, dim=-1)
    return processor.batch_decode(pred_ids)

predict(load_file_to_data('./demo.wav'))
```

**Test Result**: WIP % 


## Training


The script used for training can be found Here- Coming Soon!