WavLM Base+ French Italian Phonemizer

The WavLM Base Plus Phonemizer FR IT is a phonemization model for both French and Italian. Given an audio file, it will output the words heard using IPA.

This is an ongoing work. The model training is currently limited by the lack of training data / available work. Better version may come soon

Model Details

As inputs it takes an audio file and the desired language. It returns the list of phonemes uttered in the audio.

It does not use a language model, so it has a low likelihood of trying to map an audio on existing words.

Technically, it has uses attention masks as a third input. However it is only used when providing data as batch. Set the value of the attention mask to 1 for the audi parts that were not padded, and the rest to 0.

For instance, if you have a batch of a single audio of size [1, 100], the attention mask should be of size [1, 100], with all values set to 1.

Now, you have a second audio of length 120. You pad the first audio and get a batch of size [2, 120]. The attention mask is now of shape [2, 120], with attention_mask[0] = [1 1 ... 0] (last 20 values are zeros) and attention_mask[1] = [1 1 ... 1].

Uses

The model works with French and Italian audios. TO prepare you Python env:

pip install torch torchaudio transformers

Let's transcribe this audio:

You can use the following code.

"""
Simple demonstration.
See main.py for a more complete demonstration.
"""
import torch
import torchaudio
import transformers

import wavlm_phoneme_fr_it

# Load the CTC processor
feature_extractor = transformers.AutoFeatureExtractor.from_pretrained(
    "microsoft/wavlm-base-plus"
)
tokenizer = transformers.Wav2Vec2CTCTokenizer(
    "./vocab.json", unk_token="[UNK]", pad_token="[PAD]", word_delimiter_token="|"
)
processor = transformers.Wav2Vec2Processor(feature_extractor=feature_extractor, tokenizer=tokenizer)

inputs = processor(
    audio_arrays,
    sampling_rate=SAMPLING_RATE,
    padding=True,
    return_tensors="pt",
)
inputs["language"] = [row["language"] for row in audio_files]  # "fr" or "it"


# Model with weights
model = wavlm_phoneme_fr_it.WavLMPhonemeFrIt.from_pretrained(
    "hugofara/wavlm-base-plus-phonemizer-fr-it"
)
# Do inference
with torch.no_grad():
    logits = model(**inputs).logits

# Simple ArgMax for demonstration
label_ids = torch.argmax(logits, -1) 
predictions = processor.batch_decode(label_ids)
print("Final phonemes are:", "".join(prediction))
# Should output: "sakapitalɛtsɑ̃kɛʁ"

Intended public

This model was mainly thought for clinicians that need audio transcriptions on a great volume of data. As the training was conducted on adult voices, it has the same speech recognition biases as "normal" adult voices, which means it corrects accents as long as they are well spread.

It is forbidden to use this model for any harmful purpose.

Training Details

Training Data

The dataset was adapted from Common Voice 17.0, French + Italian versions. To get an API representation of the sentences, a phonemizer from text was used: charsiu/g2p_multilingual_byT5_small_100. The language of each sample (either French or Italian) was also saved as a dataset feature.

Training Procedure

Only the training split of Common Voice 17.0 is used during training.

First, only the language model head was trained (a linear layer). We freeze both the weights of the feature encoder and the transformer. We use a tri-steps training with a linear warm-up, the a constant learning rate, and a linear decrease. The loss used is a CTC loss, and the evaluation metric is the Phoneme Error Rate (PER). Once the PER decreases below 60%, the initial training stops. Due to the size of the dataset, one epoch is enough.

For the second phase of training, we unfreeze the transformer. We start the same training procedure, a tri-state linear warm-up from scratch. At the time of writing, the model did three epochs only to avoid over-fitting.

Evaluation

The results are measure in Phoneme Error Rate, PER for short. Using the test set of Common Voice 17.0, we achieve almost 10% PER.

Technical Specifications

The model contains WavLM Base+ For CTC, which has a language model head.

This linear classifier has the following inputs:

The first input is the language (0 for French, 1 for Italian).
The next 768 are the raw outputs of WavLM Base+.

To get phonemes from this output, you can simply use an arg max and map the indices over vocab.json.

Authors

Developed by: HugoFara
Funded by: NCCR Evolving Language

The training was conducted as a part of the NCCR Evolving Language group, a Swiss research institute on language.

It was developped during a study by Pr. Daphné Bavelier and Pr. Angela Pasqualotto.

Related works

The model was created as a successor, and an extension, to Cnam-LMSSC/wav2vec2-french-phonemizer. The main differences are a more modern base model (WavLM Base + vs Wav2Vec 2.0), and a different training procedure.

But wait, PER on Cnam-LMSSC/wav2vec2-french-phonemizer is 5%, here it is 10%, isn't that worse?

Not the same kind of measurement. On the previous model, PER is measured on the training set (with a risk of overfitting), while our PER is on some data the model never saw. For reference, we once achieved 2% PER on the training set with 100 epochs, yet it was still 18% PER on the validation set.

Nevertheless, the work is ongoing.

See also this very good multilanguage version: ASR-Project/Multilingual-PR.

Todo list

Data augmentation to finish the model training
Cleaner dataset with a better phonemizer.

hugofara
/

wavlm-base-plus-phonemizer-fr-it