File size: 6,142 Bytes

84df765
5f0fdc4
84df765
 
 
 
 
 
 
 
 
 
 
 
 
 
011a031
164b32a
a8d998c
ebd4025
a8d998c
 
 
ebd4025
a8d998c
 
 
 
 
 
 
 
 
5f0fdc4
a8d998c
ebd4025
a8d998c
 
 
 
 
 
 
 
91c1cb4
a8d998c
91c1cb4
 
 
a8d998c
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
8a71b7a
 
 
a8d998c
 
 
 
 
 
8a71b7a
a8d998c
 
 
 
 
8a71b7a
 
a8d998c
ebd4025
a8d998c
8a71b7a
 
a8d998c
8a71b7a
 
 
a8d998c
 
8a71b7a
a8d998c
8a71b7a
5700d22
 
 
 
 
a8d998c
 
ebd4025
a8d998c
ebd4025

---
license: other
datasets:
- openslr/librispeech_asr
language:
- en
metrics:
- wer
tags:
- transformers
- pytorch
- speech-to-text
- conformer
- embedded
- edgeAI
- ExecuTorch
- audioprocessing
- transformer
---
# Arm ExecuTorch Conformer

<!-- Provide a quick summary of what the model is/does. -->

Conformer is a popular Transformer based speech recognition network, suitable for embedded devices. This repository contains an example of FP32 trained weights and the associated tokenizer for an implementation of Conformer. We also include exported quantized program with ExecuTorch, quantized for the ExecuTorch Ethos-U backend allowing an easy deployment on SoCs with an Arm® Ethos™-U NPU.  
## Model Details

### Model Description

Conformer is a popular Neural Network for speech recognition. This repository contains trained weights for the Conformer implementation in https://github.com/sooftware/conformer/ 

- **Developed by:** Arm
- **Model type:** Transformer
- **Language(s) (NLP):** English
- **License:** BigScience OpenRAIL-M v1.1

### Model Sources

<!-- Provide the basic links for the model. -->

- **Repository:** https://github.com/sooftware/conformer/
- **Paper [optional]:** https://arxiv.org/abs/2005.08100

## Uses

You need to install ExecuTorch 1.0 with `$ pip install executorch`.  

By downloading the quantized exported graph module, you can directly call the `to_edge_transform_and_lower` API of ExecuTorch. 
The `to_edge_transform_and_lower` API will convert the quantized exported program to backend-specific command stream for the Ethos-U.
The end result a pte file for your variant of the Ethos-U. 
Below is an example script to produce a pte file for Ethos-U85 256 MAC configuration in Shared_Sram memory mode.
```
import torch
from executorch.backends.arm.ethosu import EthosUPartitioner, EthosUCompileSpec
from executorch.backends.arm.quantizer import (
    EthosUQuantizer,
    get_symmetric_quantization_config,
)
from executorch.exir import (
    EdgeCompileConfig,
    ExecutorchBackendConfig,
    to_edge_transform_and_lower,
)
from executorch.extension.export_util.utils import save_pte_program

def main():
    quant_exported_program = torch.export.load("Conformer_ArmQuantizer_quant_exported_program.pt2")
    compile_spec = EthosUCompileSpec(
            target="ethos-u85-256",
            system_config="Ethos_U85_SYS_Flash_High",
            memory_mode="Shared_Sram",
            extra_flags=["--output-format=raw", "--debug-force-regor"],
        )
    partitioner = EthosUPartitioner(compile_spec)
    print(
        "Calling to_edge_transform_and_lower - lowering to TOSA and compiling for the Ethos-U hardware"
    )
    # Lower the exported program to the Ethos-U backend
    edge_program_manager = to_edge_transform_and_lower(
        quant_exported_program,
        partitioner=[partitioner],
        compile_config=EdgeCompileConfig(
            _check_ir_validity=False,
        ),
    )
    executorch_program_manager = edge_program_manager.to_executorch(
        config=ExecutorchBackendConfig(extract_delegate_segments=False)
    )
    save_pte_program(
        executorch_program_manager, f"conformer_quantized.pte"
    )


if __name__ == "__main__":
    main()
```

 
## How to Get Started with the Model

To you can download directly the quantized exported program for the Ethos-U backend(`Conformer_ArmQuantizer_quant_exported_program.pt2`) and call the `to_edge_transform_and_lower` ExecuTorch API. 
This means you don't need to train the model from scratch and you don't need to load & pre-process representative dataset for calibration. You can focus on developing the application running on device. 
For an example end-to-end speech-to-text application running on an embedded platform, have a look at https://gitlab.arm.com/artificial-intelligence/ethos-u/ml-embedded-evaluation-kit/-/blob/experimental/executorch/docs/use_cases/asr.md

## Training Details

### Training Data

<!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->
We used the LibriSpeech 960h dataset. The dataset is composed of 460h of clean audio data and 500h of more noisy data. We obtain the dataset as part of the PyTorch torchaudio library.  


### Training Procedure

<!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. -->
If you want to train the Conformer model from scratch, you can do so by following the instructions in https://github.com/Arm-Examples/ML-examples/tree/main/pytorch-conformer-train-quantize/training
We used an AWS g5.24xlarge instance to train the NN.

#### Preprocessing 

We first train a tokenizer on the Librispeech dataset. The tokenizer converts labels into tokens. For example, in English, it is very common to have 's at the end of words, the tokenizer will identify that patten and have a dedicated token for the 's combination. 
The code to obtain the tokenizer is available in https://github.com/Arm-Examples/ML-examples/blob/main/pytorch-conformer-train-quantize/training/build_sp_128_librispeech.py . The trained tokenizer is also available in the Hugging Face repository.  

We also apply a MelSpectrogram on the input audio as per the Conformer paper - the LibriSpeech dataset contains audio recordings sampled at 16kHz. The Conformer
paper recommends 25ms window length, corresponding to 400 samples(16000*0.025=400) and a stride of 10ms, corresponding to 160 samples(16000*0.01). We use 80 filter banks as
recommended by the paper and 512 FFTs.


#### Training Hyperparameters

- **Training regime:** The model is trained in FP32
- **Epochs:** 160
- **Batch size:** 96
- **Learning rate:** 0.0005
- **Weight decay:** 1e-6
- **Warmup-epochs:** 2.0


### Testing Data

We test the model on the LibriSpeech `test-clean` dataset and obtain 7% Word Error Rate. The accuracy of the model may be improved through training with additional datasets, and through data augmentation techniques such as time slicing.