Upload folder using huggingface_hub
Browse files- .gitattributes +6 -0
- .gitignore +1 -0
- .huggingfaceignore +1 -0
- README.md +74 -0
- am.mvn +8 -0
- chinese.wav +3 -0
- chn_jpn_yue_eng_ko_spectok.bpe.model +3 -0
- convert_rknn.py +95 -0
- embedding.npy +3 -0
- english.wav +3 -0
- fsmn-am.mvn +8 -0
- fsmn-config.yaml +59 -0
- fsmnvad-offline.onnx +3 -0
- librknnrt.so +3 -0
- requirements.txt +19 -0
- rknn_toolkit_lite2-2.3.2-cp310-cp310-manylinux_2_17_aarch64.manylinux2014_aarch64.whl +3 -0
- sense-voice-encoder.rknn +3 -0
- sensevoice_rknn.py +1402 -0
- wget-log +6 -0
.gitattributes
CHANGED
@@ -33,3 +33,9 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
|
|
33 |
*.zip filter=lfs diff=lfs merge=lfs -text
|
34 |
*.zst filter=lfs diff=lfs merge=lfs -text
|
35 |
*tfevents* filter=lfs diff=lfs merge=lfs -text
|
|
|
|
|
|
|
|
|
|
|
|
|
|
33 |
*.zip filter=lfs diff=lfs merge=lfs -text
|
34 |
*.zst filter=lfs diff=lfs merge=lfs -text
|
35 |
*tfevents* filter=lfs diff=lfs merge=lfs -text
|
36 |
+
output.wav filter=lfs diff=lfs merge=lfs -text
|
37 |
+
sense-voice-encoder.rknn filter=lfs diff=lfs merge=lfs -text
|
38 |
+
chinese.wav filter=lfs diff=lfs merge=lfs -text
|
39 |
+
english.wav filter=lfs diff=lfs merge=lfs -text
|
40 |
+
librknnrt.so filter=lfs diff=lfs merge=lfs -text
|
41 |
+
rknn_toolkit_lite2-2.3.2-cp310-cp310-manylinux_2_17_aarch64.manylinux2014_aarch64.whl filter=lfs diff=lfs merge=lfs -text
|
.gitignore
ADDED
@@ -0,0 +1 @@
|
|
|
|
|
1 |
+
myenv/
|
.huggingfaceignore
ADDED
@@ -0,0 +1 @@
|
|
|
|
|
1 |
+
myenv/
|
README.md
ADDED
@@ -0,0 +1,74 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
---
|
2 |
+
license: agpl-3.0
|
3 |
+
language:
|
4 |
+
- en
|
5 |
+
- zh
|
6 |
+
- ja
|
7 |
+
- ko
|
8 |
+
base_model: lovemefan/SenseVoice-onnx
|
9 |
+
tags:
|
10 |
+
- rknn
|
11 |
+
---
|
12 |
+
|
13 |
+
# SenseVoiceSmall-RKNN2
|
14 |
+
|
15 |
+
SenseVoice is an audio foundation model with audio understanding capabilities, including Automatic Speech Recognition (ASR), Language Identification (LID), Speech Emotion Recognition (SER), and Acoustic Event Classification (AEC) or Acoustic Event Detection (AED).
|
16 |
+
|
17 |
+
Currently, SenseVoice-small supports multilingual speech recognition, emotion recognition, and event detection for Chinese, Cantonese, English, Japanese, and Korean, with extremely low inference latency.
|
18 |
+
|
19 |
+
- Inference speed (RKNN2): About 20x real-time on a single NPU core of RK3588 (processing 20 seconds of audio per second), approximately 6 times faster than the official whisper model provided in the rknn-model-zoo.
|
20 |
+
- Memory usage (RKNN2): About 1.1GB
|
21 |
+
|
22 |
+
## Usage
|
23 |
+
|
24 |
+
1. Clone the project to your local machine
|
25 |
+
|
26 |
+
2. Install dependencies
|
27 |
+
|
28 |
+
```bash
|
29 |
+
pip install kaldi_native_fbank onnxruntime sentencepiece soundfile pyyaml numpy<2
|
30 |
+
|
31 |
+
pip install rknn_toolkit_lite2-2.3.2-cp310-cp310-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
|
32 |
+
```
|
33 |
+
[Source](https://github.com/airockchip/rknn-toolkit2/blob/master/rknn-toolkit-lite2/packages/rknn_toolkit_lite2-2.3.2-cp310-cp310-manylinux_2_17_aarch64.manylinux2014_aarch64.whl) of the .whl file:
|
34 |
+
|
35 |
+
3. Copy librknnt.so to /usr/lib/
|
36 |
+
|
37 |
+
Source of librknnt.so: https://github.com/airockchip/rknn-toolkit2/blob/master/rknpu2/runtime/Linux/librknn_api/aarch64/librknnrt.so
|
38 |
+
|
39 |
+
4. Run
|
40 |
+
|
41 |
+
```bash
|
42 |
+
python ./sensevoice_rknn.py --audio_file english.wav
|
43 |
+
```
|
44 |
+
|
45 |
+
If you find that recognition is not working correctly when testing with your own audio files, you may need to convert them to 16kHz, 16-bit, mono WAV format in advance.
|
46 |
+
|
47 |
+
```bash
|
48 |
+
ffmpeg -i input.mp3 -f wav -acodec pcm_s16le -ac 1 -ar 16000 output.wav
|
49 |
+
```
|
50 |
+
|
51 |
+
## RKNN Model Conversion
|
52 |
+
|
53 |
+
You need to install rknn-toolkit2 v2.1.0 or higher in advance.
|
54 |
+
|
55 |
+
1. Download or convert the ONNX model
|
56 |
+
|
57 |
+
You can download the ONNX model from https://huggingface.co/lovemefan/SenseVoice-onnx.
|
58 |
+
It should also be possible to convert from a PyTorch model to an ONNX model according to the documentation at https://github.com/FunAudioLLM/SenseVoice.
|
59 |
+
|
60 |
+
The model file should be named 'sense-voice-encoder.onnx' and placed in the same directory as the conversion script.
|
61 |
+
|
62 |
+
2. Convert to RKNN model
|
63 |
+
```bash
|
64 |
+
python convert_rknn.py
|
65 |
+
```
|
66 |
+
|
67 |
+
## Known Issues
|
68 |
+
|
69 |
+
- When using fp16 inference with RKNN2, overflow may occur, resulting in inf values. You can try modifying the scaling ratio of the input data to resolve this.
|
70 |
+
Set `SPEECH_SCALE` to a smaller value in `sensevoice_rknn.py`.
|
71 |
+
|
72 |
+
## References
|
73 |
+
- [FunAudioLLM/SenseVoiceSmall](https://huggingface.co/FunAudioLLM/SenseVoiceSmall)
|
74 |
+
- [lovemefan/SenseVoice-python](https://github.com/lovemefan/SenseVoice-python)
|
am.mvn
ADDED
@@ -0,0 +1,8 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
<Nnet>
|
2 |
+
<Splice> 560 560
|
3 |
+
[ 0 ]
|
4 |
+
<AddShift> 560 560
|
5 |
+
<LearnRateCoef> 0 [ -8.311879 -8.600912 -9.615928 -10.43595 -11.21292 -11.88333 -12.36243 -12.63706 -12.8818 -12.83066 -12.89103 -12.95666 -13.19763 -13.40598 -13.49113 -13.5546 -13.55639 -13.51915 -13.68284 -13.53289 -13.42107 -13.65519 -13.50713 -13.75251 -13.76715 -13.87408 -13.73109 -13.70412 -13.56073 -13.53488 -13.54895 -13.56228 -13.59408 -13.62047 -13.64198 -13.66109 -13.62669 -13.58297 -13.57387 -13.4739 -13.53063 -13.48348 -13.61047 -13.64716 -13.71546 -13.79184 -13.90614 -14.03098 -14.18205 -14.35881 -14.48419 -14.60172 -14.70591 -14.83362 -14.92122 -15.00622 -15.05122 -15.03119 -14.99028 -14.92302 -14.86927 -14.82691 -14.7972 -14.76909 -14.71356 -14.61277 -14.51696 -14.42252 -14.36405 -14.30451 -14.23161 -14.19851 -14.16633 -14.15649 -14.10504 -13.99518 -13.79562 -13.3996 -12.7767 -11.71208 -8.311879 -8.600912 -9.615928 -10.43595 -11.21292 -11.88333 -12.36243 -12.63706 -12.8818 -12.83066 -12.89103 -12.95666 -13.19763 -13.40598 -13.49113 -13.5546 -13.55639 -13.51915 -13.68284 -13.53289 -13.42107 -13.65519 -13.50713 -13.75251 -13.76715 -13.87408 -13.73109 -13.70412 -13.56073 -13.53488 -13.54895 -13.56228 -13.59408 -13.62047 -13.64198 -13.66109 -13.62669 -13.58297 -13.57387 -13.4739 -13.53063 -13.48348 -13.61047 -13.64716 -13.71546 -13.79184 -13.90614 -14.03098 -14.18205 -14.35881 -14.48419 -14.60172 -14.70591 -14.83362 -14.92122 -15.00622 -15.05122 -15.03119 -14.99028 -14.92302 -14.86927 -14.82691 -14.7972 -14.76909 -14.71356 -14.61277 -14.51696 -14.42252 -14.36405 -14.30451 -14.23161 -14.19851 -14.16633 -14.15649 -14.10504 -13.99518 -13.79562 -13.3996 -12.7767 -11.71208 -8.311879 -8.600912 -9.615928 -10.43595 -11.21292 -11.88333 -12.36243 -12.63706 -12.8818 -12.83066 -12.89103 -12.95666 -13.19763 -13.40598 -13.49113 -13.5546 -13.55639 -13.51915 -13.68284 -13.53289 -13.42107 -13.65519 -13.50713 -13.75251 -13.76715 -13.87408 -13.73109 -13.70412 -13.56073 -13.53488 -13.54895 -13.56228 -13.59408 -13.62047 -13.64198 -13.66109 -13.62669 -13.58297 -13.57387 -13.4739 -13.53063 -13.48348 -13.61047 -13.64716 -13.71546 -13.79184 -13.90614 -14.03098 -14.18205 -14.35881 -14.48419 -14.60172 -14.70591 -14.83362 -14.92122 -15.00622 -15.05122 -15.03119 -14.99028 -14.92302 -14.86927 -14.82691 -14.7972 -14.76909 -14.71356 -14.61277 -14.51696 -14.42252 -14.36405 -14.30451 -14.23161 -14.19851 -14.16633 -14.15649 -14.10504 -13.99518 -13.79562 -13.3996 -12.7767 -11.71208 -8.311879 -8.600912 -9.615928 -10.43595 -11.21292 -11.88333 -12.36243 -12.63706 -12.8818 -12.83066 -12.89103 -12.95666 -13.19763 -13.40598 -13.49113 -13.5546 -13.55639 -13.51915 -13.68284 -13.53289 -13.42107 -13.65519 -13.50713 -13.75251 -13.76715 -13.87408 -13.73109 -13.70412 -13.56073 -13.53488 -13.54895 -13.56228 -13.59408 -13.62047 -13.64198 -13.66109 -13.62669 -13.58297 -13.57387 -13.4739 -13.53063 -13.48348 -13.61047 -13.64716 -13.71546 -13.79184 -13.90614 -14.03098 -14.18205 -14.35881 -14.48419 -14.60172 -14.70591 -14.83362 -14.92122 -15.00622 -15.05122 -15.03119 -14.99028 -14.92302 -14.86927 -14.82691 -14.7972 -14.76909 -14.71356 -14.61277 -14.51696 -14.42252 -14.36405 -14.30451 -14.23161 -14.19851 -14.16633 -14.15649 -14.10504 -13.99518 -13.79562 -13.3996 -12.7767 -11.71208 -8.311879 -8.600912 -9.615928 -10.43595 -11.21292 -11.88333 -12.36243 -12.63706 -12.8818 -12.83066 -12.89103 -12.95666 -13.19763 -13.40598 -13.49113 -13.5546 -13.55639 -13.51915 -13.68284 -13.53289 -13.42107 -13.65519 -13.50713 -13.75251 -13.76715 -13.87408 -13.73109 -13.70412 -13.56073 -13.53488 -13.54895 -13.56228 -13.59408 -13.62047 -13.64198 -13.66109 -13.62669 -13.58297 -13.57387 -13.4739 -13.53063 -13.48348 -13.61047 -13.64716 -13.71546 -13.79184 -13.90614 -14.03098 -14.18205 -14.35881 -14.48419 -14.60172 -14.70591 -14.83362 -14.92122 -15.00622 -15.05122 -15.03119 -14.99028 -14.92302 -14.86927 -14.82691 -14.7972 -14.76909 -14.71356 -14.61277 -14.51696 -14.42252 -14.36405 -14.30451 -14.23161 -14.19851 -14.16633 -14.15649 -14.10504 -13.99518 -13.79562 -13.3996 -12.7767 -11.71208 -8.311879 -8.600912 -9.615928 -10.43595 -11.21292 -11.88333 -12.36243 -12.63706 -12.8818 -12.83066 -12.89103 -12.95666 -13.19763 -13.40598 -13.49113 -13.5546 -13.55639 -13.51915 -13.68284 -13.53289 -13.42107 -13.65519 -13.50713 -13.75251 -13.76715 -13.87408 -13.73109 -13.70412 -13.56073 -13.53488 -13.54895 -13.56228 -13.59408 -13.62047 -13.64198 -13.66109 -13.62669 -13.58297 -13.57387 -13.4739 -13.53063 -13.48348 -13.61047 -13.64716 -13.71546 -13.79184 -13.90614 -14.03098 -14.18205 -14.35881 -14.48419 -14.60172 -14.70591 -14.83362 -14.92122 -15.00622 -15.05122 -15.03119 -14.99028 -14.92302 -14.86927 -14.82691 -14.7972 -14.76909 -14.71356 -14.61277 -14.51696 -14.42252 -14.36405 -14.30451 -14.23161 -14.19851 -14.16633 -14.15649 -14.10504 -13.99518 -13.79562 -13.3996 -12.7767 -11.71208 -8.311879 -8.600912 -9.615928 -10.43595 -11.21292 -11.88333 -12.36243 -12.63706 -12.8818 -12.83066 -12.89103 -12.95666 -13.19763 -13.40598 -13.49113 -13.5546 -13.55639 -13.51915 -13.68284 -13.53289 -13.42107 -13.65519 -13.50713 -13.75251 -13.76715 -13.87408 -13.73109 -13.70412 -13.56073 -13.53488 -13.54895 -13.56228 -13.59408 -13.62047 -13.64198 -13.66109 -13.62669 -13.58297 -13.57387 -13.4739 -13.53063 -13.48348 -13.61047 -13.64716 -13.71546 -13.79184 -13.90614 -14.03098 -14.18205 -14.35881 -14.48419 -14.60172 -14.70591 -14.83362 -14.92122 -15.00622 -15.05122 -15.03119 -14.99028 -14.92302 -14.86927 -14.82691 -14.7972 -14.76909 -14.71356 -14.61277 -14.51696 -14.42252 -14.36405 -14.30451 -14.23161 -14.19851 -14.16633 -14.15649 -14.10504 -13.99518 -13.79562 -13.3996 -12.7767 -11.71208 ]
|
6 |
+
<Rescale> 560 560
|
7 |
+
<LearnRateCoef> 0 [ 0.155775 0.154484 0.1527379 0.1518718 0.1506028 0.1489256 0.147067 0.1447061 0.1436307 0.1443568 0.1451849 0.1455157 0.1452821 0.1445717 0.1439195 0.1435867 0.1436018 0.1438781 0.1442086 0.1448844 0.1454756 0.145663 0.146268 0.1467386 0.1472724 0.147664 0.1480913 0.1483739 0.1488841 0.1493636 0.1497088 0.1500379 0.1502916 0.1505389 0.1506787 0.1507102 0.1505992 0.1505445 0.1505938 0.1508133 0.1509569 0.1512396 0.1514625 0.1516195 0.1516156 0.1515561 0.1514966 0.1513976 0.1512612 0.151076 0.1510596 0.1510431 0.151077 0.1511168 0.1511917 0.151023 0.1508045 0.1505885 0.1503493 0.1502373 0.1501726 0.1500762 0.1500065 0.1499782 0.150057 0.1502658 0.150469 0.1505335 0.1505505 0.1505328 0.1504275 0.1502438 0.1499674 0.1497118 0.1494661 0.1493102 0.1493681 0.1495501 0.1499738 0.1509654 0.155775 0.154484 0.1527379 0.1518718 0.1506028 0.1489256 0.147067 0.1447061 0.1436307 0.1443568 0.1451849 0.1455157 0.1452821 0.1445717 0.1439195 0.1435867 0.1436018 0.1438781 0.1442086 0.1448844 0.1454756 0.145663 0.146268 0.1467386 0.1472724 0.147664 0.1480913 0.1483739 0.1488841 0.1493636 0.1497088 0.1500379 0.1502916 0.1505389 0.1506787 0.1507102 0.1505992 0.1505445 0.1505938 0.1508133 0.1509569 0.1512396 0.1514625 0.1516195 0.1516156 0.1515561 0.1514966 0.1513976 0.1512612 0.151076 0.1510596 0.1510431 0.151077 0.1511168 0.1511917 0.151023 0.1508045 0.1505885 0.1503493 0.1502373 0.1501726 0.1500762 0.1500065 0.1499782 0.150057 0.1502658 0.150469 0.1505335 0.1505505 0.1505328 0.1504275 0.1502438 0.1499674 0.1497118 0.1494661 0.1493102 0.1493681 0.1495501 0.1499738 0.1509654 0.155775 0.154484 0.1527379 0.1518718 0.1506028 0.1489256 0.147067 0.1447061 0.1436307 0.1443568 0.1451849 0.1455157 0.1452821 0.1445717 0.1439195 0.1435867 0.1436018 0.1438781 0.1442086 0.1448844 0.1454756 0.145663 0.146268 0.1467386 0.1472724 0.147664 0.1480913 0.1483739 0.1488841 0.1493636 0.1497088 0.1500379 0.1502916 0.1505389 0.1506787 0.1507102 0.1505992 0.1505445 0.1505938 0.1508133 0.1509569 0.1512396 0.1514625 0.1516195 0.1516156 0.1515561 0.1514966 0.1513976 0.1512612 0.151076 0.1510596 0.1510431 0.151077 0.1511168 0.1511917 0.151023 0.1508045 0.1505885 0.1503493 0.1502373 0.1501726 0.1500762 0.1500065 0.1499782 0.150057 0.1502658 0.150469 0.1505335 0.1505505 0.1505328 0.1504275 0.1502438 0.1499674 0.1497118 0.1494661 0.1493102 0.1493681 0.1495501 0.1499738 0.1509654 0.155775 0.154484 0.1527379 0.1518718 0.1506028 0.1489256 0.147067 0.1447061 0.1436307 0.1443568 0.1451849 0.1455157 0.1452821 0.1445717 0.1439195 0.1435867 0.1436018 0.1438781 0.1442086 0.1448844 0.1454756 0.145663 0.146268 0.1467386 0.1472724 0.147664 0.1480913 0.1483739 0.1488841 0.1493636 0.1497088 0.1500379 0.1502916 0.1505389 0.1506787 0.1507102 0.1505992 0.1505445 0.1505938 0.1508133 0.1509569 0.1512396 0.1514625 0.1516195 0.1516156 0.1515561 0.1514966 0.1513976 0.1512612 0.151076 0.1510596 0.1510431 0.151077 0.1511168 0.1511917 0.151023 0.1508045 0.1505885 0.1503493 0.1502373 0.1501726 0.1500762 0.1500065 0.1499782 0.150057 0.1502658 0.150469 0.1505335 0.1505505 0.1505328 0.1504275 0.1502438 0.1499674 0.1497118 0.1494661 0.1493102 0.1493681 0.1495501 0.1499738 0.1509654 0.155775 0.154484 0.1527379 0.1518718 0.1506028 0.1489256 0.147067 0.1447061 0.1436307 0.1443568 0.1451849 0.1455157 0.1452821 0.1445717 0.1439195 0.1435867 0.1436018 0.1438781 0.1442086 0.1448844 0.1454756 0.145663 0.146268 0.1467386 0.1472724 0.147664 0.1480913 0.1483739 0.1488841 0.1493636 0.1497088 0.1500379 0.1502916 0.1505389 0.1506787 0.1507102 0.1505992 0.1505445 0.1505938 0.1508133 0.1509569 0.1512396 0.1514625 0.1516195 0.1516156 0.1515561 0.1514966 0.1513976 0.1512612 0.151076 0.1510596 0.1510431 0.151077 0.1511168 0.1511917 0.151023 0.1508045 0.1505885 0.1503493 0.1502373 0.1501726 0.1500762 0.1500065 0.1499782 0.150057 0.1502658 0.150469 0.1505335 0.1505505 0.1505328 0.1504275 0.1502438 0.1499674 0.1497118 0.1494661 0.1493102 0.1493681 0.1495501 0.1499738 0.1509654 0.155775 0.154484 0.1527379 0.1518718 0.1506028 0.1489256 0.147067 0.1447061 0.1436307 0.1443568 0.1451849 0.1455157 0.1452821 0.1445717 0.1439195 0.1435867 0.1436018 0.1438781 0.1442086 0.1448844 0.1454756 0.145663 0.146268 0.1467386 0.1472724 0.147664 0.1480913 0.1483739 0.1488841 0.1493636 0.1497088 0.1500379 0.1502916 0.1505389 0.1506787 0.1507102 0.1505992 0.1505445 0.1505938 0.1508133 0.1509569 0.1512396 0.1514625 0.1516195 0.1516156 0.1515561 0.1514966 0.1513976 0.1512612 0.151076 0.1510596 0.1510431 0.151077 0.1511168 0.1511917 0.151023 0.1508045 0.1505885 0.1503493 0.1502373 0.1501726 0.1500762 0.1500065 0.1499782 0.150057 0.1502658 0.150469 0.1505335 0.1505505 0.1505328 0.1504275 0.1502438 0.1499674 0.1497118 0.1494661 0.1493102 0.1493681 0.1495501 0.1499738 0.1509654 0.155775 0.154484 0.1527379 0.1518718 0.1506028 0.1489256 0.147067 0.1447061 0.1436307 0.1443568 0.1451849 0.1455157 0.1452821 0.1445717 0.1439195 0.1435867 0.1436018 0.1438781 0.1442086 0.1448844 0.1454756 0.145663 0.146268 0.1467386 0.1472724 0.147664 0.1480913 0.1483739 0.1488841 0.1493636 0.1497088 0.1500379 0.1502916 0.1505389 0.1506787 0.1507102 0.1505992 0.1505445 0.1505938 0.1508133 0.1509569 0.1512396 0.1514625 0.1516195 0.1516156 0.1515561 0.1514966 0.1513976 0.1512612 0.151076 0.1510596 0.1510431 0.151077 0.1511168 0.1511917 0.151023 0.1508045 0.1505885 0.1503493 0.1502373 0.1501726 0.1500762 0.1500065 0.1499782 0.150057 0.1502658 0.150469 0.1505335 0.1505505 0.1505328 0.1504275 0.1502438 0.1499674 0.1497118 0.1494661 0.1493102 0.1493681 0.1495501 0.1499738 0.1509654 ]
|
8 |
+
</Nnet>
|
chinese.wav
ADDED
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
1 |
+
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:b6f02d2c58b9a8a294a306ccb60bdf667587d74984915a3ec87a6de5e04bb020
|
3 |
+
size 1289994
|
chn_jpn_yue_eng_ko_spectok.bpe.model
ADDED
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
1 |
+
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:aa87f86064c3730d799ddf7af3c04659151102cba548bce325cf06ba4da4e6a8
|
3 |
+
size 377341
|
convert_rknn.py
ADDED
@@ -0,0 +1,95 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
#!/usr/bin/env python
|
2 |
+
# coding: utf-8
|
3 |
+
|
4 |
+
import os
|
5 |
+
from rknn.api import RKNN
|
6 |
+
from math import exp
|
7 |
+
from sys import exit
|
8 |
+
import argparse
|
9 |
+
import onnxscript
|
10 |
+
from onnxscript.rewriter import pattern
|
11 |
+
import onnx.numpy_helper as onh
|
12 |
+
import numpy as np
|
13 |
+
import onnx
|
14 |
+
import onnxruntime as ort
|
15 |
+
from rknn.utils import onnx_edit
|
16 |
+
|
17 |
+
os.chdir(os.path.dirname(os.path.abspath(__file__)))
|
18 |
+
|
19 |
+
speech_length = 171
|
20 |
+
|
21 |
+
def convert_encoder():
|
22 |
+
rknn = RKNN(verbose=True)
|
23 |
+
|
24 |
+
ONNX_MODEL=f"sense-voice-encoder.onnx"
|
25 |
+
RKNN_MODEL=ONNX_MODEL.replace(".onnx",".rknn")
|
26 |
+
DATASET="dataset.txt"
|
27 |
+
QUANTIZE=False
|
28 |
+
|
29 |
+
#开局先给我来个大惊喜,rknn做第一步常量折叠的时候就会在这个子图里报错,所以要单独拿出来先跑一遍
|
30 |
+
#然后把这个子图的输出结果保存下来喂给rknn
|
31 |
+
onnx.utils.extract_model(ONNX_MODEL, "extract_model.onnx", ['speech_lengths'], ['/make_pad_mask/Cast_2_output_0'])
|
32 |
+
sess = ort.InferenceSession("extract_model.onnx", providers=['CPUExecutionProvider'])
|
33 |
+
extract_result = sess.run(None, {"speech_lengths": np.array([speech_length], dtype=np.int64)})[0]
|
34 |
+
|
35 |
+
# 删掉模型最后的多余transpose, 速度从365ms提升到350ms
|
36 |
+
ret = onnx_edit(model = ONNX_MODEL,
|
37 |
+
export_path = ONNX_MODEL.replace(".onnx", "_edited.onnx"),
|
38 |
+
# # 1, len, 25055 -> 1, 25055, 1, len # 这个是坏的, 我真服了,
|
39 |
+
# outputs_transform = {'encoder_out': 'a,b,c->a,c,1,b'},
|
40 |
+
outputs_transform = {'encoder_out': 'a,b,c->a,c,b'},
|
41 |
+
)
|
42 |
+
ONNX_MODEL = ONNX_MODEL.replace(".onnx", "_edited.onnx")
|
43 |
+
|
44 |
+
# pre-process config
|
45 |
+
print('--> Config model')
|
46 |
+
rknn.config(quantized_algorithm='normal', quantized_method='channel', target_platform='rk3588', optimization_level=3)
|
47 |
+
print('done')
|
48 |
+
|
49 |
+
# Load ONNX model
|
50 |
+
print("--> Loading model")
|
51 |
+
ret = rknn.load_onnx(
|
52 |
+
model=ONNX_MODEL,
|
53 |
+
inputs=["speech", "/make_pad_mask/Cast_2_output_0"],
|
54 |
+
input_size_list=[[1, speech_length, 560], [extract_result.shape[0], extract_result.shape[1]]],
|
55 |
+
input_initial_val=[None, extract_result],
|
56 |
+
# outputs=["output"]
|
57 |
+
)
|
58 |
+
|
59 |
+
if ret != 0:
|
60 |
+
print('Load model failed!')
|
61 |
+
exit(ret)
|
62 |
+
print('done')
|
63 |
+
|
64 |
+
# Build model
|
65 |
+
print('--> Building model')
|
66 |
+
ret = rknn.build(do_quantization=QUANTIZE, dataset=DATASET, rknn_batch_size=None)
|
67 |
+
if ret != 0:
|
68 |
+
print('Build model failed!')
|
69 |
+
exit(ret)
|
70 |
+
print('done')
|
71 |
+
|
72 |
+
# export
|
73 |
+
print('--> Export RKNN model')
|
74 |
+
ret = rknn.export_rknn(RKNN_MODEL)
|
75 |
+
if ret != 0:
|
76 |
+
print('Export RKNN model failed!')
|
77 |
+
exit(ret)
|
78 |
+
print('done')
|
79 |
+
|
80 |
+
# usage: python convert_rknn.py encoder|all
|
81 |
+
|
82 |
+
if __name__ == "__main__":
|
83 |
+
parser = argparse.ArgumentParser()
|
84 |
+
parser.add_argument("model", type=str, help="model to convert", choices=["encoder", "all"], nargs='?')
|
85 |
+
args = parser.parse_args()
|
86 |
+
if args.model is None:
|
87 |
+
args.model = "all"
|
88 |
+
|
89 |
+
if args.model == "encoder":
|
90 |
+
convert_encoder()
|
91 |
+
elif args.model == "all":
|
92 |
+
convert_encoder()
|
93 |
+
else:
|
94 |
+
print(f"Unknown model: {args.model}")
|
95 |
+
exit(1)
|
embedding.npy
ADDED
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
1 |
+
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:83cf1fc5680fdf6d7edb411be5ce351cad4eca03b29a5bf5050aa19dfcc12267
|
3 |
+
size 35968
|
english.wav
ADDED
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
1 |
+
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:9774197507f8ea811cb7f9f25b81c0e7b25bce55c125774b5ca4e832008727a1
|
3 |
+
size 587660
|
fsmn-am.mvn
ADDED
@@ -0,0 +1,8 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
<Nnet>
|
2 |
+
<Splice> 400 400
|
3 |
+
[ 0 ]
|
4 |
+
<AddShift> 400 400
|
5 |
+
<LearnRateCoef> 0 [ -8.311879 -8.600912 -9.615928 -10.43595 -11.21292 -11.88333 -12.36243 -12.63706 -12.8818 -12.83066 -12.89103 -12.95666 -13.19763 -13.40598 -13.49113 -13.5546 -13.55639 -13.51915 -13.68284 -13.53289 -13.42107 -13.65519 -13.50713 -13.75251 -13.76715 -13.87408 -13.73109 -13.70412 -13.56073 -13.53488 -13.54895 -13.56228 -13.59408 -13.62047 -13.64198 -13.66109 -13.62669 -13.58297 -13.57387 -13.4739 -13.53063 -13.48348 -13.61047 -13.64716 -13.71546 -13.79184 -13.90614 -14.03098 -14.18205 -14.35881 -14.48419 -14.60172 -14.70591 -14.83362 -14.92122 -15.00622 -15.05122 -15.03119 -14.99028 -14.92302 -14.86927 -14.82691 -14.7972 -14.76909 -14.71356 -14.61277 -14.51696 -14.42252 -14.36405 -14.30451 -14.23161 -14.19851 -14.16633 -14.15649 -14.10504 -13.99518 -13.79562 -13.3996 -12.7767 -11.71208 -8.311879 -8.600912 -9.615928 -10.43595 -11.21292 -11.88333 -12.36243 -12.63706 -12.8818 -12.83066 -12.89103 -12.95666 -13.19763 -13.40598 -13.49113 -13.5546 -13.55639 -13.51915 -13.68284 -13.53289 -13.42107 -13.65519 -13.50713 -13.75251 -13.76715 -13.87408 -13.73109 -13.70412 -13.56073 -13.53488 -13.54895 -13.56228 -13.59408 -13.62047 -13.64198 -13.66109 -13.62669 -13.58297 -13.57387 -13.4739 -13.53063 -13.48348 -13.61047 -13.64716 -13.71546 -13.79184 -13.90614 -14.03098 -14.18205 -14.35881 -14.48419 -14.60172 -14.70591 -14.83362 -14.92122 -15.00622 -15.05122 -15.03119 -14.99028 -14.92302 -14.86927 -14.82691 -14.7972 -14.76909 -14.71356 -14.61277 -14.51696 -14.42252 -14.36405 -14.30451 -14.23161 -14.19851 -14.16633 -14.15649 -14.10504 -13.99518 -13.79562 -13.3996 -12.7767 -11.71208 -8.311879 -8.600912 -9.615928 -10.43595 -11.21292 -11.88333 -12.36243 -12.63706 -12.8818 -12.83066 -12.89103 -12.95666 -13.19763 -13.40598 -13.49113 -13.5546 -13.55639 -13.51915 -13.68284 -13.53289 -13.42107 -13.65519 -13.50713 -13.75251 -13.76715 -13.87408 -13.73109 -13.70412 -13.56073 -13.53488 -13.54895 -13.56228 -13.59408 -13.62047 -13.64198 -13.66109 -13.62669 -13.58297 -13.57387 -13.4739 -13.53063 -13.48348 -13.61047 -13.64716 -13.71546 -13.79184 -13.90614 -14.03098 -14.18205 -14.35881 -14.48419 -14.60172 -14.70591 -14.83362 -14.92122 -15.00622 -15.05122 -15.03119 -14.99028 -14.92302 -14.86927 -14.82691 -14.7972 -14.76909 -14.71356 -14.61277 -14.51696 -14.42252 -14.36405 -14.30451 -14.23161 -14.19851 -14.16633 -14.15649 -14.10504 -13.99518 -13.79562 -13.3996 -12.7767 -11.71208 -8.311879 -8.600912 -9.615928 -10.43595 -11.21292 -11.88333 -12.36243 -12.63706 -12.8818 -12.83066 -12.89103 -12.95666 -13.19763 -13.40598 -13.49113 -13.5546 -13.55639 -13.51915 -13.68284 -13.53289 -13.42107 -13.65519 -13.50713 -13.75251 -13.76715 -13.87408 -13.73109 -13.70412 -13.56073 -13.53488 -13.54895 -13.56228 -13.59408 -13.62047 -13.64198 -13.66109 -13.62669 -13.58297 -13.57387 -13.4739 -13.53063 -13.48348 -13.61047 -13.64716 -13.71546 -13.79184 -13.90614 -14.03098 -14.18205 -14.35881 -14.48419 -14.60172 -14.70591 -14.83362 -14.92122 -15.00622 -15.05122 -15.03119 -14.99028 -14.92302 -14.86927 -14.82691 -14.7972 -14.76909 -14.71356 -14.61277 -14.51696 -14.42252 -14.36405 -14.30451 -14.23161 -14.19851 -14.16633 -14.15649 -14.10504 -13.99518 -13.79562 -13.3996 -12.7767 -11.71208 -8.311879 -8.600912 -9.615928 -10.43595 -11.21292 -11.88333 -12.36243 -12.63706 -12.8818 -12.83066 -12.89103 -12.95666 -13.19763 -13.40598 -13.49113 -13.5546 -13.55639 -13.51915 -13.68284 -13.53289 -13.42107 -13.65519 -13.50713 -13.75251 -13.76715 -13.87408 -13.73109 -13.70412 -13.56073 -13.53488 -13.54895 -13.56228 -13.59408 -13.62047 -13.64198 -13.66109 -13.62669 -13.58297 -13.57387 -13.4739 -13.53063 -13.48348 -13.61047 -13.64716 -13.71546 -13.79184 -13.90614 -14.03098 -14.18205 -14.35881 -14.48419 -14.60172 -14.70591 -14.83362 -14.92122 -15.00622 -15.05122 -15.03119 -14.99028 -14.92302 -14.86927 -14.82691 -14.7972 -14.76909 -14.71356 -14.61277 -14.51696 -14.42252 -14.36405 -14.30451 -14.23161 -14.19851 -14.16633 -14.15649 -14.10504 -13.99518 -13.79562 -13.3996 -12.7767 -11.71208 ]
|
6 |
+
<Rescale> 400 400
|
7 |
+
<LearnRateCoef> 0 [ 0.155775 0.154484 0.1527379 0.1518718 0.1506028 0.1489256 0.147067 0.1447061 0.1436307 0.1443568 0.1451849 0.1455157 0.1452821 0.1445717 0.1439195 0.1435867 0.1436018 0.1438781 0.1442086 0.1448844 0.1454756 0.145663 0.146268 0.1467386 0.1472724 0.147664 0.1480913 0.1483739 0.1488841 0.1493636 0.1497088 0.1500379 0.1502916 0.1505389 0.1506787 0.1507102 0.1505992 0.1505445 0.1505938 0.1508133 0.1509569 0.1512396 0.1514625 0.1516195 0.1516156 0.1515561 0.1514966 0.1513976 0.1512612 0.151076 0.1510596 0.1510431 0.151077 0.1511168 0.1511917 0.151023 0.1508045 0.1505885 0.1503493 0.1502373 0.1501726 0.1500762 0.1500065 0.1499782 0.150057 0.1502658 0.150469 0.1505335 0.1505505 0.1505328 0.1504275 0.1502438 0.1499674 0.1497118 0.1494661 0.1493102 0.1493681 0.1495501 0.1499738 0.1509654 0.155775 0.154484 0.1527379 0.1518718 0.1506028 0.1489256 0.147067 0.1447061 0.1436307 0.1443568 0.1451849 0.1455157 0.1452821 0.1445717 0.1439195 0.1435867 0.1436018 0.1438781 0.1442086 0.1448844 0.1454756 0.145663 0.146268 0.1467386 0.1472724 0.147664 0.1480913 0.1483739 0.1488841 0.1493636 0.1497088 0.1500379 0.1502916 0.1505389 0.1506787 0.1507102 0.1505992 0.1505445 0.1505938 0.1508133 0.1509569 0.1512396 0.1514625 0.1516195 0.1516156 0.1515561 0.1514966 0.1513976 0.1512612 0.151076 0.1510596 0.1510431 0.151077 0.1511168 0.1511917 0.151023 0.1508045 0.1505885 0.1503493 0.1502373 0.1501726 0.1500762 0.1500065 0.1499782 0.150057 0.1502658 0.150469 0.1505335 0.1505505 0.1505328 0.1504275 0.1502438 0.1499674 0.1497118 0.1494661 0.1493102 0.1493681 0.1495501 0.1499738 0.1509654 0.155775 0.154484 0.1527379 0.1518718 0.1506028 0.1489256 0.147067 0.1447061 0.1436307 0.1443568 0.1451849 0.1455157 0.1452821 0.1445717 0.1439195 0.1435867 0.1436018 0.1438781 0.1442086 0.1448844 0.1454756 0.145663 0.146268 0.1467386 0.1472724 0.147664 0.1480913 0.1483739 0.1488841 0.1493636 0.1497088 0.1500379 0.1502916 0.1505389 0.1506787 0.1507102 0.1505992 0.1505445 0.1505938 0.1508133 0.1509569 0.1512396 0.1514625 0.1516195 0.1516156 0.1515561 0.1514966 0.1513976 0.1512612 0.151076 0.1510596 0.1510431 0.151077 0.1511168 0.1511917 0.151023 0.1508045 0.1505885 0.1503493 0.1502373 0.1501726 0.1500762 0.1500065 0.1499782 0.150057 0.1502658 0.150469 0.1505335 0.1505505 0.1505328 0.1504275 0.1502438 0.1499674 0.1497118 0.1494661 0.1493102 0.1493681 0.1495501 0.1499738 0.1509654 0.155775 0.154484 0.1527379 0.1518718 0.1506028 0.1489256 0.147067 0.1447061 0.1436307 0.1443568 0.1451849 0.1455157 0.1452821 0.1445717 0.1439195 0.1435867 0.1436018 0.1438781 0.1442086 0.1448844 0.1454756 0.145663 0.146268 0.1467386 0.1472724 0.147664 0.1480913 0.1483739 0.1488841 0.1493636 0.1497088 0.1500379 0.1502916 0.1505389 0.1506787 0.1507102 0.1505992 0.1505445 0.1505938 0.1508133 0.1509569 0.1512396 0.1514625 0.1516195 0.1516156 0.1515561 0.1514966 0.1513976 0.1512612 0.151076 0.1510596 0.1510431 0.151077 0.1511168 0.1511917 0.151023 0.1508045 0.1505885 0.1503493 0.1502373 0.1501726 0.1500762 0.1500065 0.1499782 0.150057 0.1502658 0.150469 0.1505335 0.1505505 0.1505328 0.1504275 0.1502438 0.1499674 0.1497118 0.1494661 0.1493102 0.1493681 0.1495501 0.1499738 0.1509654 0.155775 0.154484 0.1527379 0.1518718 0.1506028 0.1489256 0.147067 0.1447061 0.1436307 0.1443568 0.1451849 0.1455157 0.1452821 0.1445717 0.1439195 0.1435867 0.1436018 0.1438781 0.1442086 0.1448844 0.1454756 0.145663 0.146268 0.1467386 0.1472724 0.147664 0.1480913 0.1483739 0.1488841 0.1493636 0.1497088 0.1500379 0.1502916 0.1505389 0.1506787 0.1507102 0.1505992 0.1505445 0.1505938 0.1508133 0.1509569 0.1512396 0.1514625 0.1516195 0.1516156 0.1515561 0.1514966 0.1513976 0.1512612 0.151076 0.1510596 0.1510431 0.151077 0.1511168 0.1511917 0.151023 0.1508045 0.1505885 0.1503493 0.1502373 0.1501726 0.1500762 0.1500065 0.1499782 0.150057 0.1502658 0.150469 0.1505335 0.1505505 0.1505328 0.1504275 0.1502438 0.1499674 0.1497118 0.1494661 0.1493102 0.1493681 0.1495501 0.1499738 0.1509654 ]
|
8 |
+
</Nnet>
|
fsmn-config.yaml
ADDED
@@ -0,0 +1,59 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
WavFrontend:
|
2 |
+
frontend_conf:
|
3 |
+
fs: 16000
|
4 |
+
window: hamming
|
5 |
+
n_mels: 80
|
6 |
+
frame_length: 25
|
7 |
+
frame_shift: 10
|
8 |
+
dither: 0.0
|
9 |
+
lfr_m: 5
|
10 |
+
lfr_n: 1
|
11 |
+
|
12 |
+
FSMN:
|
13 |
+
use_cuda: False
|
14 |
+
CUDAExecutionProvider:
|
15 |
+
device_id: 0
|
16 |
+
arena_extend_strategy: kNextPowerOfTwo
|
17 |
+
cudnn_conv_algo_search: EXHAUSTIVE
|
18 |
+
do_copy_in_default_stream: true
|
19 |
+
encoder_conf:
|
20 |
+
input_dim: 400
|
21 |
+
input_affine_dim: 140
|
22 |
+
fsmn_layers: 4
|
23 |
+
linear_dim: 250
|
24 |
+
proj_dim: 128
|
25 |
+
lorder: 20
|
26 |
+
rorder: 0
|
27 |
+
lstride: 1
|
28 |
+
rstride: 0
|
29 |
+
output_affine_dim: 140
|
30 |
+
output_dim: 248
|
31 |
+
|
32 |
+
vadPostArgs:
|
33 |
+
sample_rate: 16000
|
34 |
+
detect_mode: 1
|
35 |
+
snr_mode: 0
|
36 |
+
max_end_silence_time: 800
|
37 |
+
max_start_silence_time: 3000
|
38 |
+
do_start_point_detection: True
|
39 |
+
do_end_point_detection: True
|
40 |
+
window_size_ms: 200
|
41 |
+
sil_to_speech_time_thres: 150
|
42 |
+
speech_to_sil_time_thres: 150
|
43 |
+
speech_2_noise_ratio: 1.0
|
44 |
+
do_extend: 1
|
45 |
+
lookback_time_start_point: 200
|
46 |
+
lookahead_time_end_point: 100
|
47 |
+
max_single_segment_time: 10000
|
48 |
+
snr_thres: -100.0
|
49 |
+
noise_frame_num_used_for_snr: 100
|
50 |
+
decibel_thres: -100.0
|
51 |
+
speech_noise_thres: 0.6
|
52 |
+
fe_prior_thres: 0.0001
|
53 |
+
silence_pdf_num: 1
|
54 |
+
sil_pdf_ids: [ 0 ]
|
55 |
+
speech_noise_thresh_low: -0.1
|
56 |
+
speech_noise_thresh_high: 0.3
|
57 |
+
output_frame_probs: False
|
58 |
+
frame_in_ms: 10
|
59 |
+
frame_length_ms: 25
|
fsmnvad-offline.onnx
ADDED
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
1 |
+
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:4bbd68b11519e916b6871ff6f8df15e2100936b256be9cb104cd63fb7c859965
|
3 |
+
size 1725472
|
librknnrt.so
ADDED
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
1 |
+
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:d31fc19c85b85f6091b2bd0f6af9d962d5264a4e410bfb536402ec92bac738e8
|
3 |
+
size 7726232
|
requirements.txt
ADDED
@@ -0,0 +1,19 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
cffi==1.17.1
|
2 |
+
coloredlogs==15.0.1
|
3 |
+
flatbuffers==25.2.10
|
4 |
+
humanfriendly==10.0
|
5 |
+
kaldi-native-fbank==1.21.2
|
6 |
+
mpmath==1.3.0
|
7 |
+
numpy==2.2.6
|
8 |
+
onnxruntime==1.22.0
|
9 |
+
packaging==25.0
|
10 |
+
protobuf==6.31.1
|
11 |
+
psutil==7.0.0
|
12 |
+
pycparser==2.22
|
13 |
+
PyYAML==6.0.2
|
14 |
+
# rknn-toolkit-lite2 @ file:./rknn_toolkit_lite2-2.3.2-cp310-cp310-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
|
15 |
+
ruamel.yaml==0.18.14
|
16 |
+
ruamel.yaml.clib==0.2.12
|
17 |
+
sentencepiece==0.2.0
|
18 |
+
soundfile==0.13.1
|
19 |
+
sympy==1.14.0
|
rknn_toolkit_lite2-2.3.2-cp310-cp310-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
ADDED
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
1 |
+
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:821e80c95e6838308c133915660b1a6ae78bb8d079b2cbbd46a02dae61192d33
|
3 |
+
size 559386
|
sense-voice-encoder.rknn
ADDED
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
1 |
+
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:8db70c1a8d4887e35dff55ab0f5d8da283d32359bd1599ece51eb81f99a6f468
|
3 |
+
size 485687354
|
sensevoice_rknn.py
ADDED
@@ -0,0 +1,1402 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
# File: onnx/fsmn_vad_ort_session.py
|
2 |
+
# ```py
|
3 |
+
|
4 |
+
# -*- coding:utf-8 -*-
|
5 |
+
# @FileName :fsmn_vad_ort_session.py.py
|
6 |
+
# @Time :2024/8/31 16:45
|
7 |
+
# @Author :lovemefan
|
8 |
+
# @Email :[email protected]
|
9 |
+
|
10 |
+
import argparse
|
11 |
+
import logging
|
12 |
+
import math
|
13 |
+
import os
|
14 |
+
import time
|
15 |
+
import warnings
|
16 |
+
from enum import Enum
|
17 |
+
from pathlib import Path
|
18 |
+
from typing import Any, Dict, List, Tuple, Union
|
19 |
+
|
20 |
+
import kaldi_native_fbank as knf
|
21 |
+
import numpy as np
|
22 |
+
import sentencepiece as spm
|
23 |
+
import soundfile as sf
|
24 |
+
import yaml
|
25 |
+
from onnxruntime import (GraphOptimizationLevel, InferenceSession,
|
26 |
+
SessionOptions, get_available_providers, get_device)
|
27 |
+
from rknnlite.api.rknn_lite import RKNNLite
|
28 |
+
|
29 |
+
RKNN_INPUT_LEN = 171
|
30 |
+
|
31 |
+
SPEECH_SCALE = 1/2 # 因为是fp16推理,如果中间结果太大可能会溢出变inf,所以需要缩放一下
|
32 |
+
|
33 |
+
class VadOrtInferRuntimeSession:
|
34 |
+
def __init__(self, config, root_dir: Path):
|
35 |
+
sess_opt = SessionOptions()
|
36 |
+
sess_opt.log_severity_level = 4
|
37 |
+
sess_opt.enable_cpu_mem_arena = False
|
38 |
+
sess_opt.graph_optimization_level = GraphOptimizationLevel.ORT_ENABLE_ALL
|
39 |
+
|
40 |
+
cuda_ep = "CUDAExecutionProvider"
|
41 |
+
cpu_ep = "CPUExecutionProvider"
|
42 |
+
cpu_provider_options = {
|
43 |
+
"arena_extend_strategy": "kSameAsRequested",
|
44 |
+
}
|
45 |
+
|
46 |
+
EP_list = []
|
47 |
+
if (
|
48 |
+
config["use_cuda"]
|
49 |
+
and get_device() == "GPU"
|
50 |
+
and cuda_ep in get_available_providers()
|
51 |
+
):
|
52 |
+
EP_list = [(cuda_ep, config[cuda_ep])]
|
53 |
+
EP_list.append((cpu_ep, cpu_provider_options))
|
54 |
+
|
55 |
+
config["model_path"] = root_dir / str(config["model_path"])
|
56 |
+
self._verify_model(config["model_path"])
|
57 |
+
logging.info(f"Loading onnx model at {str(config['model_path'])}")
|
58 |
+
self.session = InferenceSession(
|
59 |
+
str(config["model_path"]), sess_options=sess_opt, providers=EP_list
|
60 |
+
)
|
61 |
+
|
62 |
+
if config["use_cuda"] and cuda_ep not in self.session.get_providers():
|
63 |
+
logging.warning(
|
64 |
+
f"{cuda_ep} is not available for current env, "
|
65 |
+
f"the inference part is automatically shifted to be "
|
66 |
+
f"executed under {cpu_ep}.\n "
|
67 |
+
"Please ensure the installed onnxruntime-gpu version"
|
68 |
+
" matches your cuda and cudnn version, "
|
69 |
+
"you can check their relations from the offical web site: "
|
70 |
+
"https://onnxruntime.ai/docs/execution-providers/CUDA-ExecutionProvider.html",
|
71 |
+
RuntimeWarning,
|
72 |
+
)
|
73 |
+
|
74 |
+
def __call__(
|
75 |
+
self, input_content: List[Union[np.ndarray, np.ndarray]]
|
76 |
+
) -> np.ndarray:
|
77 |
+
if isinstance(input_content, list):
|
78 |
+
input_dict = {
|
79 |
+
"speech": input_content[0],
|
80 |
+
"in_cache0": input_content[1],
|
81 |
+
"in_cache1": input_content[2],
|
82 |
+
"in_cache2": input_content[3],
|
83 |
+
"in_cache3": input_content[4],
|
84 |
+
}
|
85 |
+
else:
|
86 |
+
input_dict = {"speech": input_content}
|
87 |
+
|
88 |
+
return self.session.run(None, input_dict)
|
89 |
+
|
90 |
+
def get_input_names(
|
91 |
+
self,
|
92 |
+
):
|
93 |
+
return [v.name for v in self.session.get_inputs()]
|
94 |
+
|
95 |
+
def get_output_names(
|
96 |
+
self,
|
97 |
+
):
|
98 |
+
return [v.name for v in self.session.get_outputs()]
|
99 |
+
|
100 |
+
def get_character_list(self, key: str = "character"):
|
101 |
+
return self.meta_dict[key].splitlines()
|
102 |
+
|
103 |
+
def have_key(self, key: str = "character") -> bool:
|
104 |
+
self.meta_dict = self.session.get_modelmeta().custom_metadata_map
|
105 |
+
if key in self.meta_dict.keys():
|
106 |
+
return True
|
107 |
+
return False
|
108 |
+
|
109 |
+
@staticmethod
|
110 |
+
def _verify_model(model_path):
|
111 |
+
model_path = Path(model_path)
|
112 |
+
if not model_path.exists():
|
113 |
+
raise FileNotFoundError(f"{model_path} does not exists.")
|
114 |
+
if not model_path.is_file():
|
115 |
+
raise FileExistsError(f"{model_path} is not a file.")
|
116 |
+
|
117 |
+
# ```
|
118 |
+
|
119 |
+
# File: onnx/sense_voice_ort_session.py
|
120 |
+
# ```py
|
121 |
+
# -*- coding:utf-8 -*-
|
122 |
+
# @FileName :sense_voice_onnxruntime.py
|
123 |
+
# @Time :2024/7/17 20:53
|
124 |
+
# @Author :lovemefan
|
125 |
+
# @Email :[email protected]
|
126 |
+
|
127 |
+
|
128 |
+
formatter = "%(asctime)s %(levelname)s [%(filename)s:%(lineno)d] %(message)s"
|
129 |
+
logging.basicConfig(format=formatter, level=logging.INFO)
|
130 |
+
|
131 |
+
|
132 |
+
class OrtInferRuntimeSession:
|
133 |
+
def __init__(self, model_file, device_id=-1, intra_op_num_threads=4):
|
134 |
+
device_id = str(device_id)
|
135 |
+
sess_opt = SessionOptions()
|
136 |
+
sess_opt.intra_op_num_threads = intra_op_num_threads
|
137 |
+
sess_opt.log_severity_level = 4
|
138 |
+
sess_opt.enable_cpu_mem_arena = False
|
139 |
+
sess_opt.graph_optimization_level = GraphOptimizationLevel.ORT_ENABLE_ALL
|
140 |
+
|
141 |
+
cuda_ep = "CUDAExecutionProvider"
|
142 |
+
cuda_provider_options = {
|
143 |
+
"device_id": device_id,
|
144 |
+
"arena_extend_strategy": "kNextPowerOfTwo",
|
145 |
+
"cudnn_conv_algo_search": "EXHAUSTIVE",
|
146 |
+
"do_copy_in_default_stream": "true",
|
147 |
+
}
|
148 |
+
cpu_ep = "CPUExecutionProvider"
|
149 |
+
cpu_provider_options = {
|
150 |
+
"arena_extend_strategy": "kSameAsRequested",
|
151 |
+
}
|
152 |
+
|
153 |
+
EP_list = []
|
154 |
+
if (
|
155 |
+
device_id != "-1"
|
156 |
+
and get_device() == "GPU"
|
157 |
+
and cuda_ep in get_available_providers()
|
158 |
+
):
|
159 |
+
EP_list = [(cuda_ep, cuda_provider_options)]
|
160 |
+
EP_list.append((cpu_ep, cpu_provider_options))
|
161 |
+
|
162 |
+
self._verify_model(model_file)
|
163 |
+
|
164 |
+
self.session = InferenceSession(
|
165 |
+
model_file, sess_options=sess_opt, providers=EP_list
|
166 |
+
)
|
167 |
+
|
168 |
+
# delete binary of model file to save memory
|
169 |
+
del model_file
|
170 |
+
|
171 |
+
if device_id != "-1" and cuda_ep not in self.session.get_providers():
|
172 |
+
warnings.warn(
|
173 |
+
f"{cuda_ep} is not avaiable for current env, the inference part is automatically shifted to be executed under {cpu_ep}.\n"
|
174 |
+
"Please ensure the installed onnxruntime-gpu version matches your cuda and cudnn version, "
|
175 |
+
"you can check their relations from the offical web site: "
|
176 |
+
"https://onnxruntime.ai/docs/execution-providers/CUDA-ExecutionProvider.html",
|
177 |
+
RuntimeWarning,
|
178 |
+
)
|
179 |
+
|
180 |
+
def __call__(self, input_content) -> np.ndarray:
|
181 |
+
input_dict = dict(zip(self.get_input_names(), input_content))
|
182 |
+
try:
|
183 |
+
result = self.session.run(self.get_output_names(), input_dict)
|
184 |
+
return result
|
185 |
+
except Exception as e:
|
186 |
+
print(e)
|
187 |
+
raise RuntimeError(f"ONNXRuntime inferece failed. ") from e
|
188 |
+
|
189 |
+
def get_input_names(
|
190 |
+
self,
|
191 |
+
):
|
192 |
+
return [v.name for v in self.session.get_inputs()]
|
193 |
+
|
194 |
+
def get_output_names(
|
195 |
+
self,
|
196 |
+
):
|
197 |
+
return [v.name for v in self.session.get_outputs()]
|
198 |
+
|
199 |
+
def get_character_list(self, key: str = "character"):
|
200 |
+
return self.meta_dict[key].splitlines()
|
201 |
+
|
202 |
+
def have_key(self, key: str = "character") -> bool:
|
203 |
+
self.meta_dict = self.session.get_modelmeta().custom_metadata_map
|
204 |
+
if key in self.meta_dict.keys():
|
205 |
+
return True
|
206 |
+
return False
|
207 |
+
|
208 |
+
@staticmethod
|
209 |
+
def _verify_model(model_path):
|
210 |
+
model_path = Path(model_path)
|
211 |
+
if not model_path.exists():
|
212 |
+
raise FileNotFoundError(f"{model_path} does not exists.")
|
213 |
+
if not model_path.is_file():
|
214 |
+
raise FileExistsError(f"{model_path} is not a file.")
|
215 |
+
|
216 |
+
|
217 |
+
def log_softmax(x: np.ndarray) -> np.ndarray:
|
218 |
+
# Subtract the maximum value in each row for numerical stability
|
219 |
+
x_max = np.max(x, axis=-1, keepdims=True)
|
220 |
+
# Calculate the softmax of x
|
221 |
+
softmax = np.exp(x - x_max)
|
222 |
+
softmax_sum = np.sum(softmax, axis=-1, keepdims=True)
|
223 |
+
softmax = softmax / softmax_sum
|
224 |
+
# Calculate the log of the softmax values
|
225 |
+
return np.log(softmax)
|
226 |
+
|
227 |
+
|
228 |
+
class SenseVoiceInferenceSession:
|
229 |
+
def __init__(
|
230 |
+
self,
|
231 |
+
embedding_model_file,
|
232 |
+
encoder_model_file,
|
233 |
+
bpe_model_file,
|
234 |
+
device_id=-1,
|
235 |
+
intra_op_num_threads=4,
|
236 |
+
):
|
237 |
+
logging.info(f"Loading model from {embedding_model_file}")
|
238 |
+
|
239 |
+
self.embedding = np.load(embedding_model_file)
|
240 |
+
logging.info(f"Loading model {encoder_model_file}")
|
241 |
+
start = time.time()
|
242 |
+
self.encoder = RKNNLite(verbose=False)
|
243 |
+
self.encoder.load_rknn(encoder_model_file)
|
244 |
+
self.encoder.init_runtime()
|
245 |
+
|
246 |
+
logging.info(
|
247 |
+
f"Loading {encoder_model_file} takes {time.time() - start:.2f} seconds"
|
248 |
+
)
|
249 |
+
self.blank_id = 0
|
250 |
+
self.sp = spm.SentencePieceProcessor()
|
251 |
+
self.sp.load(bpe_model_file)
|
252 |
+
|
253 |
+
def __call__(self, speech, language: int, use_itn: bool) -> np.ndarray:
|
254 |
+
language_query = self.embedding[[[language]]]
|
255 |
+
|
256 |
+
# 14 means with itn, 15 means without itn
|
257 |
+
text_norm_query = self.embedding[[[14 if use_itn else 15]]]
|
258 |
+
event_emo_query = self.embedding[[[1, 2]]]
|
259 |
+
|
260 |
+
# scale the speech
|
261 |
+
speech = speech * SPEECH_SCALE
|
262 |
+
|
263 |
+
input_content = np.concatenate(
|
264 |
+
[
|
265 |
+
language_query,
|
266 |
+
event_emo_query,
|
267 |
+
text_norm_query,
|
268 |
+
speech,
|
269 |
+
],
|
270 |
+
axis=1,
|
271 |
+
).astype(np.float32)
|
272 |
+
print(input_content.shape)
|
273 |
+
# pad [1, len, ...] to [1, RKNN_INPUT_LEN, ... ]
|
274 |
+
input_content = np.pad(input_content, ((0, 0), (0, RKNN_INPUT_LEN - input_content.shape[1]), (0, 0)))
|
275 |
+
print("padded shape:", input_content.shape)
|
276 |
+
start_time = time.time()
|
277 |
+
encoder_out = self.encoder.inference(inputs=[input_content])[0]
|
278 |
+
end_time = time.time()
|
279 |
+
print(f"encoder inference time: {end_time - start_time:.2f} seconds")
|
280 |
+
# print(encoder_out)
|
281 |
+
def unique_consecutive(arr):
|
282 |
+
if len(arr) == 0:
|
283 |
+
return arr
|
284 |
+
# Create a boolean mask where True indicates the element is different from the previous one
|
285 |
+
mask = np.append([True], arr[1:] != arr[:-1])
|
286 |
+
out = arr[mask]
|
287 |
+
out = out[out != self.blank_id]
|
288 |
+
return out.tolist()
|
289 |
+
|
290 |
+
#现在shape变成了1, n_vocab, n_seq. 这里axis需要改一下
|
291 |
+
# hypos = unique_consecutive(encoder_out[0].argmax(axis=-1))
|
292 |
+
hypos = unique_consecutive(encoder_out[0].argmax(axis=0))
|
293 |
+
text = self.sp.DecodeIds(hypos)
|
294 |
+
return text
|
295 |
+
|
296 |
+
# ```
|
297 |
+
|
298 |
+
# File: utils/frontend.py
|
299 |
+
# ```py
|
300 |
+
# -*- coding:utf-8 -*-
|
301 |
+
# @FileName :frontend.py
|
302 |
+
# @Time :2024/7/18 09:39
|
303 |
+
# @Author :lovemefan
|
304 |
+
# @Email :[email protected]
|
305 |
+
|
306 |
+
class WavFrontend:
|
307 |
+
"""Conventional frontend structure for ASR."""
|
308 |
+
|
309 |
+
def __init__(
|
310 |
+
self,
|
311 |
+
cmvn_file: str = None,
|
312 |
+
fs: int = 16000,
|
313 |
+
window: str = "hamming",
|
314 |
+
n_mels: int = 80,
|
315 |
+
frame_length: int = 25,
|
316 |
+
frame_shift: int = 10,
|
317 |
+
lfr_m: int = 7,
|
318 |
+
lfr_n: int = 6,
|
319 |
+
dither: float = 0,
|
320 |
+
**kwargs,
|
321 |
+
) -> None:
|
322 |
+
opts = knf.FbankOptions()
|
323 |
+
opts.frame_opts.samp_freq = fs
|
324 |
+
opts.frame_opts.dither = dither
|
325 |
+
opts.frame_opts.window_type = window
|
326 |
+
opts.frame_opts.frame_shift_ms = float(frame_shift)
|
327 |
+
opts.frame_opts.frame_length_ms = float(frame_length)
|
328 |
+
opts.mel_opts.num_bins = n_mels
|
329 |
+
opts.energy_floor = 0
|
330 |
+
opts.frame_opts.snip_edges = True
|
331 |
+
opts.mel_opts.debug_mel = False
|
332 |
+
self.opts = opts
|
333 |
+
|
334 |
+
self.lfr_m = lfr_m
|
335 |
+
self.lfr_n = lfr_n
|
336 |
+
self.cmvn_file = cmvn_file
|
337 |
+
|
338 |
+
if self.cmvn_file:
|
339 |
+
self.cmvn = self.load_cmvn()
|
340 |
+
self.fbank_fn = None
|
341 |
+
self.fbank_beg_idx = 0
|
342 |
+
self.reset_status()
|
343 |
+
|
344 |
+
def reset_status(self):
|
345 |
+
self.fbank_fn = knf.OnlineFbank(self.opts)
|
346 |
+
self.fbank_beg_idx = 0
|
347 |
+
|
348 |
+
def fbank(self, waveform: np.ndarray) -> Tuple[np.ndarray, np.ndarray]:
|
349 |
+
waveform = waveform * (1 << 15)
|
350 |
+
self.fbank_fn = knf.OnlineFbank(self.opts)
|
351 |
+
self.fbank_fn.accept_waveform(self.opts.frame_opts.samp_freq, waveform.tolist())
|
352 |
+
frames = self.fbank_fn.num_frames_ready
|
353 |
+
mat = np.empty([frames, self.opts.mel_opts.num_bins])
|
354 |
+
for i in range(frames):
|
355 |
+
mat[i, :] = self.fbank_fn.get_frame(i)
|
356 |
+
feat = mat.astype(np.float32)
|
357 |
+
feat_len = np.array(mat.shape[0]).astype(np.int32)
|
358 |
+
return feat, feat_len
|
359 |
+
|
360 |
+
def lfr_cmvn(self, feat: np.ndarray) -> Tuple[np.ndarray, np.ndarray]:
|
361 |
+
if self.lfr_m != 1 or self.lfr_n != 1:
|
362 |
+
feat = self.apply_lfr(feat, self.lfr_m, self.lfr_n)
|
363 |
+
|
364 |
+
if self.cmvn_file:
|
365 |
+
feat = self.apply_cmvn(feat)
|
366 |
+
|
367 |
+
feat_len = np.array(feat.shape[0]).astype(np.int32)
|
368 |
+
return feat, feat_len
|
369 |
+
|
370 |
+
def load_audio(self, filename: str) -> Tuple[np.ndarray, int]:
|
371 |
+
data, sample_rate = sf.read(
|
372 |
+
filename,
|
373 |
+
always_2d=True,
|
374 |
+
dtype="float32",
|
375 |
+
)
|
376 |
+
assert (
|
377 |
+
sample_rate == 16000
|
378 |
+
), f"Only 16000 Hz is supported, but got {sample_rate}Hz"
|
379 |
+
self.sample_rate = sample_rate
|
380 |
+
data = data[:, 0] # use only the first channel
|
381 |
+
samples = np.ascontiguousarray(data)
|
382 |
+
|
383 |
+
return samples, sample_rate
|
384 |
+
|
385 |
+
@staticmethod
|
386 |
+
def apply_lfr(inputs: np.ndarray, lfr_m: int, lfr_n: int) -> np.ndarray:
|
387 |
+
LFR_inputs = []
|
388 |
+
|
389 |
+
T = inputs.shape[0]
|
390 |
+
T_lfr = int(np.ceil(T / lfr_n))
|
391 |
+
left_padding = np.tile(inputs[0], ((lfr_m - 1) // 2, 1))
|
392 |
+
inputs = np.vstack((left_padding, inputs))
|
393 |
+
T = T + (lfr_m - 1) // 2
|
394 |
+
for i in range(T_lfr):
|
395 |
+
if lfr_m <= T - i * lfr_n:
|
396 |
+
LFR_inputs.append(
|
397 |
+
(inputs[i * lfr_n : i * lfr_n + lfr_m]).reshape(1, -1)
|
398 |
+
)
|
399 |
+
else:
|
400 |
+
# process last LFR frame
|
401 |
+
num_padding = lfr_m - (T - i * lfr_n)
|
402 |
+
frame = inputs[i * lfr_n :].reshape(-1)
|
403 |
+
for _ in range(num_padding):
|
404 |
+
frame = np.hstack((frame, inputs[-1]))
|
405 |
+
|
406 |
+
LFR_inputs.append(frame)
|
407 |
+
LFR_outputs = np.vstack(LFR_inputs).astype(np.float32)
|
408 |
+
return LFR_outputs
|
409 |
+
|
410 |
+
def apply_cmvn(self, inputs: np.ndarray) -> np.ndarray:
|
411 |
+
"""
|
412 |
+
Apply CMVN with mvn data
|
413 |
+
"""
|
414 |
+
frame, dim = inputs.shape
|
415 |
+
means = np.tile(self.cmvn[0:1, :dim], (frame, 1))
|
416 |
+
vars = np.tile(self.cmvn[1:2, :dim], (frame, 1))
|
417 |
+
inputs = (inputs + means) * vars
|
418 |
+
return inputs
|
419 |
+
|
420 |
+
def get_features(self, inputs: Union[str, np.ndarray]) -> Tuple[np.ndarray, int]:
|
421 |
+
if isinstance(inputs, str):
|
422 |
+
inputs, _ = self.load_audio(inputs)
|
423 |
+
|
424 |
+
fbank, _ = self.fbank(inputs)
|
425 |
+
feats = self.apply_cmvn(self.apply_lfr(fbank, self.lfr_m, self.lfr_n))
|
426 |
+
return feats
|
427 |
+
|
428 |
+
def load_cmvn(
|
429 |
+
self,
|
430 |
+
) -> np.ndarray:
|
431 |
+
with open(self.cmvn_file, "r", encoding="utf-8") as f:
|
432 |
+
lines = f.readlines()
|
433 |
+
|
434 |
+
means_list = []
|
435 |
+
vars_list = []
|
436 |
+
for i in range(len(lines)):
|
437 |
+
line_item = lines[i].split()
|
438 |
+
if line_item[0] == "<AddShift>":
|
439 |
+
line_item = lines[i + 1].split()
|
440 |
+
if line_item[0] == "<LearnRateCoef>":
|
441 |
+
add_shift_line = line_item[3 : (len(line_item) - 1)]
|
442 |
+
means_list = list(add_shift_line)
|
443 |
+
continue
|
444 |
+
elif line_item[0] == "<Rescale>":
|
445 |
+
line_item = lines[i + 1].split()
|
446 |
+
if line_item[0] == "<LearnRateCoef>":
|
447 |
+
rescale_line = line_item[3 : (len(line_item) - 1)]
|
448 |
+
vars_list = list(rescale_line)
|
449 |
+
continue
|
450 |
+
|
451 |
+
means = np.array(means_list).astype(np.float64)
|
452 |
+
vars = np.array(vars_list).astype(np.float64)
|
453 |
+
cmvn = np.array([means, vars])
|
454 |
+
return cmvn
|
455 |
+
|
456 |
+
# ```
|
457 |
+
|
458 |
+
# File: utils/fsmn_vad.py
|
459 |
+
# ```py
|
460 |
+
# -*- coding:utf-8 -*-
|
461 |
+
# @FileName :fsmn_vad.py
|
462 |
+
# @Time :2024/8/31 16:50
|
463 |
+
# @Author :lovemefan
|
464 |
+
# @Email :[email protected]
|
465 |
+
|
466 |
+
|
467 |
+
|
468 |
+
def read_yaml(yaml_path: Union[str, Path]) -> Dict:
|
469 |
+
if not Path(yaml_path).exists():
|
470 |
+
raise FileExistsError(f"The {yaml_path} does not exist.")
|
471 |
+
|
472 |
+
with open(str(yaml_path), "rb") as f:
|
473 |
+
data = yaml.load(f, Loader=yaml.Loader)
|
474 |
+
return data
|
475 |
+
|
476 |
+
|
477 |
+
class VadStateMachine(Enum):
|
478 |
+
kVadInStateStartPointNotDetected = 1
|
479 |
+
kVadInStateInSpeechSegment = 2
|
480 |
+
kVadInStateEndPointDetected = 3
|
481 |
+
|
482 |
+
|
483 |
+
class FrameState(Enum):
|
484 |
+
kFrameStateInvalid = -1
|
485 |
+
kFrameStateSpeech = 1
|
486 |
+
kFrameStateSil = 0
|
487 |
+
|
488 |
+
|
489 |
+
# final voice/unvoice state per frame
|
490 |
+
class AudioChangeState(Enum):
|
491 |
+
kChangeStateSpeech2Speech = 0
|
492 |
+
kChangeStateSpeech2Sil = 1
|
493 |
+
kChangeStateSil2Sil = 2
|
494 |
+
kChangeStateSil2Speech = 3
|
495 |
+
kChangeStateNoBegin = 4
|
496 |
+
kChangeStateInvalid = 5
|
497 |
+
|
498 |
+
|
499 |
+
class VadDetectMode(Enum):
|
500 |
+
kVadSingleUtteranceDetectMode = 0
|
501 |
+
kVadMutipleUtteranceDetectMode = 1
|
502 |
+
|
503 |
+
|
504 |
+
class VADXOptions:
|
505 |
+
def __init__(
|
506 |
+
self,
|
507 |
+
sample_rate: int = 16000,
|
508 |
+
detect_mode: int = VadDetectMode.kVadMutipleUtteranceDetectMode.value,
|
509 |
+
snr_mode: int = 0,
|
510 |
+
max_end_silence_time: int = 800,
|
511 |
+
max_start_silence_time: int = 3000,
|
512 |
+
do_start_point_detection: bool = True,
|
513 |
+
do_end_point_detection: bool = True,
|
514 |
+
window_size_ms: int = 200,
|
515 |
+
sil_to_speech_time_thres: int = 150,
|
516 |
+
speech_to_sil_time_thres: int = 150,
|
517 |
+
speech_2_noise_ratio: float = 1.0,
|
518 |
+
do_extend: int = 1,
|
519 |
+
lookback_time_start_point: int = 200,
|
520 |
+
lookahead_time_end_point: int = 100,
|
521 |
+
max_single_segment_time: int = 60000,
|
522 |
+
nn_eval_block_size: int = 8,
|
523 |
+
dcd_block_size: int = 4,
|
524 |
+
snr_thres: int = -100.0,
|
525 |
+
noise_frame_num_used_for_snr: int = 100,
|
526 |
+
decibel_thres: int = -100.0,
|
527 |
+
speech_noise_thres: float = 0.6,
|
528 |
+
fe_prior_thres: float = 1e-4,
|
529 |
+
silence_pdf_num: int = 1,
|
530 |
+
sil_pdf_ids: List[int] = [0],
|
531 |
+
speech_noise_thresh_low: float = -0.1,
|
532 |
+
speech_noise_thresh_high: float = 0.3,
|
533 |
+
output_frame_probs: bool = False,
|
534 |
+
frame_in_ms: int = 10,
|
535 |
+
frame_length_ms: int = 25,
|
536 |
+
):
|
537 |
+
self.sample_rate = sample_rate
|
538 |
+
self.detect_mode = detect_mode
|
539 |
+
self.snr_mode = snr_mode
|
540 |
+
self.max_end_silence_time = max_end_silence_time
|
541 |
+
self.max_start_silence_time = max_start_silence_time
|
542 |
+
self.do_start_point_detection = do_start_point_detection
|
543 |
+
self.do_end_point_detection = do_end_point_detection
|
544 |
+
self.window_size_ms = window_size_ms
|
545 |
+
self.sil_to_speech_time_thres = sil_to_speech_time_thres
|
546 |
+
self.speech_to_sil_time_thres = speech_to_sil_time_thres
|
547 |
+
self.speech_2_noise_ratio = speech_2_noise_ratio
|
548 |
+
self.do_extend = do_extend
|
549 |
+
self.lookback_time_start_point = lookback_time_start_point
|
550 |
+
self.lookahead_time_end_point = lookahead_time_end_point
|
551 |
+
self.max_single_segment_time = max_single_segment_time
|
552 |
+
self.nn_eval_block_size = nn_eval_block_size
|
553 |
+
self.dcd_block_size = dcd_block_size
|
554 |
+
self.snr_thres = snr_thres
|
555 |
+
self.noise_frame_num_used_for_snr = noise_frame_num_used_for_snr
|
556 |
+
self.decibel_thres = decibel_thres
|
557 |
+
self.speech_noise_thres = speech_noise_thres
|
558 |
+
self.fe_prior_thres = fe_prior_thres
|
559 |
+
self.silence_pdf_num = silence_pdf_num
|
560 |
+
self.sil_pdf_ids = sil_pdf_ids
|
561 |
+
self.speech_noise_thresh_low = speech_noise_thresh_low
|
562 |
+
self.speech_noise_thresh_high = speech_noise_thresh_high
|
563 |
+
self.output_frame_probs = output_frame_probs
|
564 |
+
self.frame_in_ms = frame_in_ms
|
565 |
+
self.frame_length_ms = frame_length_ms
|
566 |
+
|
567 |
+
|
568 |
+
class E2EVadSpeechBufWithDoa(object):
|
569 |
+
def __init__(self):
|
570 |
+
self.start_ms = 0
|
571 |
+
self.end_ms = 0
|
572 |
+
self.buffer = []
|
573 |
+
self.contain_seg_start_point = False
|
574 |
+
self.contain_seg_end_point = False
|
575 |
+
self.doa = 0
|
576 |
+
|
577 |
+
def reset(self):
|
578 |
+
self.start_ms = 0
|
579 |
+
self.end_ms = 0
|
580 |
+
self.buffer = []
|
581 |
+
self.contain_seg_start_point = False
|
582 |
+
self.contain_seg_end_point = False
|
583 |
+
self.doa = 0
|
584 |
+
|
585 |
+
|
586 |
+
class E2EVadFrameProb(object):
|
587 |
+
def __init__(self):
|
588 |
+
self.noise_prob = 0.0
|
589 |
+
self.speech_prob = 0.0
|
590 |
+
self.score = 0.0
|
591 |
+
self.frame_id = 0
|
592 |
+
self.frm_state = 0
|
593 |
+
|
594 |
+
|
595 |
+
class WindowDetector(object):
|
596 |
+
def __init__(
|
597 |
+
self,
|
598 |
+
window_size_ms: int,
|
599 |
+
sil_to_speech_time: int,
|
600 |
+
speech_to_sil_time: int,
|
601 |
+
frame_size_ms: int,
|
602 |
+
):
|
603 |
+
self.window_size_ms = window_size_ms
|
604 |
+
self.sil_to_speech_time = sil_to_speech_time
|
605 |
+
self.speech_to_sil_time = speech_to_sil_time
|
606 |
+
self.frame_size_ms = frame_size_ms
|
607 |
+
|
608 |
+
self.win_size_frame = int(window_size_ms / frame_size_ms)
|
609 |
+
self.win_sum = 0
|
610 |
+
self.win_state = [0] * self.win_size_frame # 初始化窗
|
611 |
+
|
612 |
+
self.cur_win_pos = 0
|
613 |
+
self.pre_frame_state = FrameState.kFrameStateSil
|
614 |
+
self.cur_frame_state = FrameState.kFrameStateSil
|
615 |
+
self.sil_to_speech_frmcnt_thres = int(sil_to_speech_time / frame_size_ms)
|
616 |
+
self.speech_to_sil_frmcnt_thres = int(speech_to_sil_time / frame_size_ms)
|
617 |
+
|
618 |
+
self.voice_last_frame_count = 0
|
619 |
+
self.noise_last_frame_count = 0
|
620 |
+
self.hydre_frame_count = 0
|
621 |
+
|
622 |
+
def reset(self) -> None:
|
623 |
+
self.cur_win_pos = 0
|
624 |
+
self.win_sum = 0
|
625 |
+
self.win_state = [0] * self.win_size_frame
|
626 |
+
self.pre_frame_state = FrameState.kFrameStateSil
|
627 |
+
self.cur_frame_state = FrameState.kFrameStateSil
|
628 |
+
self.voice_last_frame_count = 0
|
629 |
+
self.noise_last_frame_count = 0
|
630 |
+
self.hydre_frame_count = 0
|
631 |
+
|
632 |
+
def get_win_size(self) -> int:
|
633 |
+
return int(self.win_size_frame)
|
634 |
+
|
635 |
+
def detect_one_frame(
|
636 |
+
self, frameState: FrameState, frame_count: int
|
637 |
+
) -> AudioChangeState:
|
638 |
+
cur_frame_state = FrameState.kFrameStateSil
|
639 |
+
if frameState == FrameState.kFrameStateSpeech:
|
640 |
+
cur_frame_state = 1
|
641 |
+
elif frameState == FrameState.kFrameStateSil:
|
642 |
+
cur_frame_state = 0
|
643 |
+
else:
|
644 |
+
return AudioChangeState.kChangeStateInvalid
|
645 |
+
self.win_sum -= self.win_state[self.cur_win_pos]
|
646 |
+
self.win_sum += cur_frame_state
|
647 |
+
self.win_state[self.cur_win_pos] = cur_frame_state
|
648 |
+
self.cur_win_pos = (self.cur_win_pos + 1) % self.win_size_frame
|
649 |
+
|
650 |
+
if (
|
651 |
+
self.pre_frame_state == FrameState.kFrameStateSil
|
652 |
+
and self.win_sum >= self.sil_to_speech_frmcnt_thres
|
653 |
+
):
|
654 |
+
self.pre_frame_state = FrameState.kFrameStateSpeech
|
655 |
+
return AudioChangeState.kChangeStateSil2Speech
|
656 |
+
|
657 |
+
if (
|
658 |
+
self.pre_frame_state == FrameState.kFrameStateSpeech
|
659 |
+
and self.win_sum <= self.speech_to_sil_frmcnt_thres
|
660 |
+
):
|
661 |
+
self.pre_frame_state = FrameState.kFrameStateSil
|
662 |
+
return AudioChangeState.kChangeStateSpeech2Sil
|
663 |
+
|
664 |
+
if self.pre_frame_state == FrameState.kFrameStateSil:
|
665 |
+
return AudioChangeState.kChangeStateSil2Sil
|
666 |
+
if self.pre_frame_state == FrameState.kFrameStateSpeech:
|
667 |
+
return AudioChangeState.kChangeStateSpeech2Speech
|
668 |
+
return AudioChangeState.kChangeStateInvalid
|
669 |
+
|
670 |
+
def frame_size_ms(self) -> int:
|
671 |
+
return int(self.frame_size_ms)
|
672 |
+
|
673 |
+
|
674 |
+
class E2EVadModel:
|
675 |
+
def __init__(self, config, vad_post_args: Dict[str, Any], root_dir: Path):
|
676 |
+
super(E2EVadModel, self).__init__()
|
677 |
+
self.vad_opts = VADXOptions(**vad_post_args)
|
678 |
+
self.windows_detector = WindowDetector(
|
679 |
+
self.vad_opts.window_size_ms,
|
680 |
+
self.vad_opts.sil_to_speech_time_thres,
|
681 |
+
self.vad_opts.speech_to_sil_time_thres,
|
682 |
+
self.vad_opts.frame_in_ms,
|
683 |
+
)
|
684 |
+
self.model = VadOrtInferRuntimeSession(config, root_dir)
|
685 |
+
self.all_reset_detection()
|
686 |
+
|
687 |
+
def all_reset_detection(self):
|
688 |
+
# init variables
|
689 |
+
self.is_final = False
|
690 |
+
self.data_buf_start_frame = 0
|
691 |
+
self.frm_cnt = 0
|
692 |
+
self.latest_confirmed_speech_frame = 0
|
693 |
+
self.lastest_confirmed_silence_frame = -1
|
694 |
+
self.continous_silence_frame_count = 0
|
695 |
+
self.vad_state_machine = VadStateMachine.kVadInStateStartPointNotDetected
|
696 |
+
self.confirmed_start_frame = -1
|
697 |
+
self.confirmed_end_frame = -1
|
698 |
+
self.number_end_time_detected = 0
|
699 |
+
self.sil_frame = 0
|
700 |
+
self.sil_pdf_ids = self.vad_opts.sil_pdf_ids
|
701 |
+
self.noise_average_decibel = -100.0
|
702 |
+
self.pre_end_silence_detected = False
|
703 |
+
self.next_seg = True
|
704 |
+
|
705 |
+
self.output_data_buf = []
|
706 |
+
self.output_data_buf_offset = 0
|
707 |
+
self.frame_probs = []
|
708 |
+
self.max_end_sil_frame_cnt_thresh = (
|
709 |
+
self.vad_opts.max_end_silence_time - self.vad_opts.speech_to_sil_time_thres
|
710 |
+
)
|
711 |
+
self.speech_noise_thres = self.vad_opts.speech_noise_thres
|
712 |
+
self.scores = None
|
713 |
+
self.scores_offset = 0
|
714 |
+
self.max_time_out = False
|
715 |
+
self.decibel = []
|
716 |
+
self.decibel_offset = 0
|
717 |
+
self.data_buf_size = 0
|
718 |
+
self.data_buf_all_size = 0
|
719 |
+
self.waveform = None
|
720 |
+
self.reset_detection()
|
721 |
+
|
722 |
+
def reset_detection(self):
|
723 |
+
self.continous_silence_frame_count = 0
|
724 |
+
self.latest_confirmed_speech_frame = 0
|
725 |
+
self.lastest_confirmed_silence_frame = -1
|
726 |
+
self.confirmed_start_frame = -1
|
727 |
+
self.confirmed_end_frame = -1
|
728 |
+
self.vad_state_machine = VadStateMachine.kVadInStateStartPointNotDetected
|
729 |
+
self.windows_detector.reset()
|
730 |
+
self.sil_frame = 0
|
731 |
+
self.frame_probs = []
|
732 |
+
|
733 |
+
def compute_decibel(self) -> None:
|
734 |
+
frame_sample_length = int(
|
735 |
+
self.vad_opts.frame_length_ms * self.vad_opts.sample_rate / 1000
|
736 |
+
)
|
737 |
+
frame_shift_length = int(
|
738 |
+
self.vad_opts.frame_in_ms * self.vad_opts.sample_rate / 1000
|
739 |
+
)
|
740 |
+
if self.data_buf_all_size == 0:
|
741 |
+
self.data_buf_all_size = len(self.waveform[0])
|
742 |
+
self.data_buf_size = self.data_buf_all_size
|
743 |
+
else:
|
744 |
+
self.data_buf_all_size += len(self.waveform[0])
|
745 |
+
|
746 |
+
for offset in range(
|
747 |
+
0, self.waveform.shape[1] - frame_sample_length + 1, frame_shift_length
|
748 |
+
):
|
749 |
+
self.decibel.append(
|
750 |
+
10
|
751 |
+
* np.log10(
|
752 |
+
np.square(
|
753 |
+
self.waveform[0][offset : offset + frame_sample_length]
|
754 |
+
).sum()
|
755 |
+
+ 1e-6
|
756 |
+
)
|
757 |
+
)
|
758 |
+
|
759 |
+
def compute_scores(self, feats: np.ndarray) -> None:
|
760 |
+
scores = self.model(feats)
|
761 |
+
self.vad_opts.nn_eval_block_size = scores[0].shape[1]
|
762 |
+
self.frm_cnt += scores[0].shape[1] # count total frames
|
763 |
+
if isinstance(feats, list):
|
764 |
+
# return B * T * D
|
765 |
+
feats = feats[0]
|
766 |
+
|
767 |
+
assert (
|
768 |
+
scores[0].shape[1] == feats.shape[1]
|
769 |
+
), "The shape between feats and scores does not match"
|
770 |
+
|
771 |
+
self.scores = scores[0] # the first calculation
|
772 |
+
self.scores_offset += self.scores.shape[1]
|
773 |
+
|
774 |
+
return scores[1:]
|
775 |
+
|
776 |
+
def pop_data_buf_till_frame(self, frame_idx: int) -> None: # need check again
|
777 |
+
while self.data_buf_start_frame < frame_idx:
|
778 |
+
if self.data_buf_size >= int(
|
779 |
+
self.vad_opts.frame_in_ms * self.vad_opts.sample_rate / 1000
|
780 |
+
):
|
781 |
+
self.data_buf_start_frame += 1
|
782 |
+
self.data_buf_size = (
|
783 |
+
self.data_buf_all_size
|
784 |
+
- self.data_buf_start_frame
|
785 |
+
* int(self.vad_opts.frame_in_ms * self.vad_opts.sample_rate / 1000)
|
786 |
+
)
|
787 |
+
|
788 |
+
def pop_data_to_output_buf(
|
789 |
+
self,
|
790 |
+
start_frm: int,
|
791 |
+
frm_cnt: int,
|
792 |
+
first_frm_is_start_point: bool,
|
793 |
+
last_frm_is_end_point: bool,
|
794 |
+
end_point_is_sent_end: bool,
|
795 |
+
) -> None:
|
796 |
+
self.pop_data_buf_till_frame(start_frm)
|
797 |
+
expected_sample_number = int(
|
798 |
+
frm_cnt * self.vad_opts.sample_rate * self.vad_opts.frame_in_ms / 1000
|
799 |
+
)
|
800 |
+
if last_frm_is_end_point:
|
801 |
+
extra_sample = max(
|
802 |
+
0,
|
803 |
+
int(
|
804 |
+
self.vad_opts.frame_length_ms * self.vad_opts.sample_rate / 1000
|
805 |
+
- self.vad_opts.sample_rate * self.vad_opts.frame_in_ms / 1000
|
806 |
+
),
|
807 |
+
)
|
808 |
+
expected_sample_number += int(extra_sample)
|
809 |
+
if end_point_is_sent_end:
|
810 |
+
expected_sample_number = max(expected_sample_number, self.data_buf_size)
|
811 |
+
if self.data_buf_size < expected_sample_number:
|
812 |
+
logging.error("error in calling pop data_buf\n")
|
813 |
+
|
814 |
+
if len(self.output_data_buf) == 0 or first_frm_is_start_point:
|
815 |
+
self.output_data_buf.append(E2EVadSpeechBufWithDoa())
|
816 |
+
self.output_data_buf[-1].reset()
|
817 |
+
self.output_data_buf[-1].start_ms = start_frm * self.vad_opts.frame_in_ms
|
818 |
+
self.output_data_buf[-1].end_ms = self.output_data_buf[-1].start_ms
|
819 |
+
self.output_data_buf[-1].doa = 0
|
820 |
+
cur_seg = self.output_data_buf[-1]
|
821 |
+
if cur_seg.end_ms != start_frm * self.vad_opts.frame_in_ms:
|
822 |
+
logging.error("warning\n")
|
823 |
+
out_pos = len(cur_seg.buffer) # cur_seg.buff现在没做任何操作
|
824 |
+
data_to_pop = 0
|
825 |
+
if end_point_is_sent_end:
|
826 |
+
data_to_pop = expected_sample_number
|
827 |
+
else:
|
828 |
+
data_to_pop = int(
|
829 |
+
frm_cnt * self.vad_opts.frame_in_ms * self.vad_opts.sample_rate / 1000
|
830 |
+
)
|
831 |
+
if data_to_pop > self.data_buf_size:
|
832 |
+
logging.error("VAD data_to_pop is bigger than self.data_buf.size()!!!\n")
|
833 |
+
data_to_pop = self.data_buf_size
|
834 |
+
expected_sample_number = self.data_buf_size
|
835 |
+
|
836 |
+
cur_seg.doa = 0
|
837 |
+
for sample_cpy_out in range(0, data_to_pop):
|
838 |
+
# cur_seg.buffer[out_pos ++] = data_buf_.back();
|
839 |
+
out_pos += 1
|
840 |
+
for sample_cpy_out in range(data_to_pop, expected_sample_number):
|
841 |
+
# cur_seg.buffer[out_pos++] = data_buf_.back()
|
842 |
+
out_pos += 1
|
843 |
+
if cur_seg.end_ms != start_frm * self.vad_opts.frame_in_ms:
|
844 |
+
logging.error("Something wrong with the VAD algorithm\n")
|
845 |
+
self.data_buf_start_frame += frm_cnt
|
846 |
+
cur_seg.end_ms = (start_frm + frm_cnt) * self.vad_opts.frame_in_ms
|
847 |
+
if first_frm_is_start_point:
|
848 |
+
cur_seg.contain_seg_start_point = True
|
849 |
+
if last_frm_is_end_point:
|
850 |
+
cur_seg.contain_seg_end_point = True
|
851 |
+
|
852 |
+
def on_silence_detected(self, valid_frame: int):
|
853 |
+
self.lastest_confirmed_silence_frame = valid_frame
|
854 |
+
if self.vad_state_machine == VadStateMachine.kVadInStateStartPointNotDetected:
|
855 |
+
self.pop_data_buf_till_frame(valid_frame)
|
856 |
+
# silence_detected_callback_
|
857 |
+
# pass
|
858 |
+
|
859 |
+
def on_voice_detected(self, valid_frame: int) -> None:
|
860 |
+
self.latest_confirmed_speech_frame = valid_frame
|
861 |
+
self.pop_data_to_output_buf(valid_frame, 1, False, False, False)
|
862 |
+
|
863 |
+
def on_voice_start(self, start_frame: int, fake_result: bool = False) -> None:
|
864 |
+
if self.vad_opts.do_start_point_detection:
|
865 |
+
pass
|
866 |
+
if self.confirmed_start_frame != -1:
|
867 |
+
logging.error("not reset vad properly\n")
|
868 |
+
else:
|
869 |
+
self.confirmed_start_frame = start_frame
|
870 |
+
|
871 |
+
if (
|
872 |
+
not fake_result
|
873 |
+
and self.vad_state_machine
|
874 |
+
== VadStateMachine.kVadInStateStartPointNotDetected
|
875 |
+
):
|
876 |
+
self.pop_data_to_output_buf(
|
877 |
+
self.confirmed_start_frame, 1, True, False, False
|
878 |
+
)
|
879 |
+
|
880 |
+
def on_voice_end(
|
881 |
+
self, end_frame: int, fake_result: bool, is_last_frame: bool
|
882 |
+
) -> None:
|
883 |
+
for t in range(self.latest_confirmed_speech_frame + 1, end_frame):
|
884 |
+
self.on_voice_detected(t)
|
885 |
+
if self.vad_opts.do_end_point_detection:
|
886 |
+
pass
|
887 |
+
if self.confirmed_end_frame != -1:
|
888 |
+
logging.error("not reset vad properly\n")
|
889 |
+
else:
|
890 |
+
self.confirmed_end_frame = end_frame
|
891 |
+
if not fake_result:
|
892 |
+
self.sil_frame = 0
|
893 |
+
self.pop_data_to_output_buf(
|
894 |
+
self.confirmed_end_frame, 1, False, True, is_last_frame
|
895 |
+
)
|
896 |
+
self.number_end_time_detected += 1
|
897 |
+
|
898 |
+
def maybe_on_voice_end_last_frame(
|
899 |
+
self, is_final_frame: bool, cur_frm_idx: int
|
900 |
+
) -> None:
|
901 |
+
if is_final_frame:
|
902 |
+
self.on_voice_end(cur_frm_idx, False, True)
|
903 |
+
self.vad_state_machine = VadStateMachine.kVadInStateEndPointDetected
|
904 |
+
|
905 |
+
def get_latency(self) -> int:
|
906 |
+
return int(self.latency_frm_num_at_start_point() * self.vad_opts.frame_in_ms)
|
907 |
+
|
908 |
+
def latency_frm_num_at_start_point(self) -> int:
|
909 |
+
vad_latency = self.windows_detector.get_win_size()
|
910 |
+
if self.vad_opts.do_extend:
|
911 |
+
vad_latency += int(
|
912 |
+
self.vad_opts.lookback_time_start_point / self.vad_opts.frame_in_ms
|
913 |
+
)
|
914 |
+
return vad_latency
|
915 |
+
|
916 |
+
def get_frame_state(self, t: int) -> FrameState:
|
917 |
+
frame_state = FrameState.kFrameStateInvalid
|
918 |
+
cur_decibel = self.decibel[t - self.decibel_offset]
|
919 |
+
cur_snr = cur_decibel - self.noise_average_decibel
|
920 |
+
# for each frame, calc log posterior probability of each state
|
921 |
+
if cur_decibel < self.vad_opts.decibel_thres:
|
922 |
+
frame_state = FrameState.kFrameStateSil
|
923 |
+
self.detect_one_frame(frame_state, t, False)
|
924 |
+
return frame_state
|
925 |
+
|
926 |
+
sum_score = 0.0
|
927 |
+
noise_prob = 0.0
|
928 |
+
assert len(self.sil_pdf_ids) == self.vad_opts.silence_pdf_num
|
929 |
+
if len(self.sil_pdf_ids) > 0:
|
930 |
+
assert len(self.scores) == 1 # 只支持batch_size = 1的测试
|
931 |
+
sil_pdf_scores = [
|
932 |
+
self.scores[0][t - self.scores_offset][sil_pdf_id]
|
933 |
+
for sil_pdf_id in self.sil_pdf_ids
|
934 |
+
]
|
935 |
+
sum_score = sum(sil_pdf_scores)
|
936 |
+
noise_prob = math.log(sum_score) * self.vad_opts.speech_2_noise_ratio
|
937 |
+
total_score = 1.0
|
938 |
+
sum_score = total_score - sum_score
|
939 |
+
speech_prob = math.log(sum_score)
|
940 |
+
if self.vad_opts.output_frame_probs:
|
941 |
+
frame_prob = E2EVadFrameProb()
|
942 |
+
frame_prob.noise_prob = noise_prob
|
943 |
+
frame_prob.speech_prob = speech_prob
|
944 |
+
frame_prob.score = sum_score
|
945 |
+
frame_prob.frame_id = t
|
946 |
+
self.frame_probs.append(frame_prob)
|
947 |
+
if math.exp(speech_prob) >= math.exp(noise_prob) + self.speech_noise_thres:
|
948 |
+
if (
|
949 |
+
cur_snr >= self.vad_opts.snr_thres
|
950 |
+
and cur_decibel >= self.vad_opts.decibel_thres
|
951 |
+
):
|
952 |
+
frame_state = FrameState.kFrameStateSpeech
|
953 |
+
else:
|
954 |
+
frame_state = FrameState.kFrameStateSil
|
955 |
+
else:
|
956 |
+
frame_state = FrameState.kFrameStateSil
|
957 |
+
if self.noise_average_decibel < -99.9:
|
958 |
+
self.noise_average_decibel = cur_decibel
|
959 |
+
else:
|
960 |
+
self.noise_average_decibel = (
|
961 |
+
cur_decibel
|
962 |
+
+ self.noise_average_decibel
|
963 |
+
* (self.vad_opts.noise_frame_num_used_for_snr - 1)
|
964 |
+
) / self.vad_opts.noise_frame_num_used_for_snr
|
965 |
+
|
966 |
+
return frame_state
|
967 |
+
|
968 |
+
def infer_offline(
|
969 |
+
self,
|
970 |
+
feats: np.ndarray,
|
971 |
+
waveform: np.ndarray,
|
972 |
+
in_cache: Dict[str, np.ndarray] = dict(),
|
973 |
+
is_final: bool = False,
|
974 |
+
) -> Tuple[List[List[List[int]]], Dict[str, np.ndarray]]:
|
975 |
+
self.waveform = waveform
|
976 |
+
self.compute_decibel()
|
977 |
+
|
978 |
+
self.compute_scores(feats)
|
979 |
+
if not is_final:
|
980 |
+
self.detect_common_frames()
|
981 |
+
else:
|
982 |
+
self.detect_last_frames()
|
983 |
+
segments = []
|
984 |
+
for batch_num in range(0, feats.shape[0]): # only support batch_size = 1 now
|
985 |
+
segment_batch = []
|
986 |
+
if len(self.output_data_buf) > 0:
|
987 |
+
for i in range(self.output_data_buf_offset, len(self.output_data_buf)):
|
988 |
+
if (
|
989 |
+
not self.output_data_buf[i].contain_seg_start_point
|
990 |
+
or not self.output_data_buf[i].contain_seg_end_point
|
991 |
+
):
|
992 |
+
continue
|
993 |
+
segment = [
|
994 |
+
self.output_data_buf[i].start_ms,
|
995 |
+
self.output_data_buf[i].end_ms,
|
996 |
+
]
|
997 |
+
segment_batch.append(segment)
|
998 |
+
self.output_data_buf_offset += 1 # need update this parameter
|
999 |
+
if segment_batch:
|
1000 |
+
segments.append(segment_batch)
|
1001 |
+
|
1002 |
+
if is_final:
|
1003 |
+
# reset class variables and clear the dict for the next query
|
1004 |
+
self.all_reset_detection()
|
1005 |
+
return segments, in_cache
|
1006 |
+
|
1007 |
+
def infer_online(
|
1008 |
+
self,
|
1009 |
+
feats: np.ndarray,
|
1010 |
+
waveform: np.ndarray,
|
1011 |
+
in_cache: list = None,
|
1012 |
+
is_final: bool = False,
|
1013 |
+
max_end_sil: int = 800,
|
1014 |
+
) -> Tuple[List[List[List[int]]], Dict[str, np.ndarray]]:
|
1015 |
+
feats = [feats]
|
1016 |
+
if in_cache is None:
|
1017 |
+
in_cache = []
|
1018 |
+
|
1019 |
+
self.max_end_sil_frame_cnt_thresh = (
|
1020 |
+
max_end_sil - self.vad_opts.speech_to_sil_time_thres
|
1021 |
+
)
|
1022 |
+
self.waveform = waveform # compute decibel for each frame
|
1023 |
+
feats.extend(in_cache)
|
1024 |
+
in_cache = self.compute_scores(feats)
|
1025 |
+
self.compute_decibel()
|
1026 |
+
|
1027 |
+
if is_final:
|
1028 |
+
self.detect_last_frames()
|
1029 |
+
else:
|
1030 |
+
self.detect_common_frames()
|
1031 |
+
|
1032 |
+
segments = []
|
1033 |
+
# only support batch_size = 1 now
|
1034 |
+
for batch_num in range(0, feats[0].shape[0]):
|
1035 |
+
if len(self.output_data_buf) > 0:
|
1036 |
+
for i in range(self.output_data_buf_offset, len(self.output_data_buf)):
|
1037 |
+
if not self.output_data_buf[i].contain_seg_start_point:
|
1038 |
+
continue
|
1039 |
+
if (
|
1040 |
+
not self.next_seg
|
1041 |
+
and not self.output_data_buf[i].contain_seg_end_point
|
1042 |
+
):
|
1043 |
+
continue
|
1044 |
+
start_ms = self.output_data_buf[i].start_ms if self.next_seg else -1
|
1045 |
+
if self.output_data_buf[i].contain_seg_end_point:
|
1046 |
+
end_ms = self.output_data_buf[i].end_ms
|
1047 |
+
self.next_seg = True
|
1048 |
+
self.output_data_buf_offset += 1
|
1049 |
+
else:
|
1050 |
+
end_ms = -1
|
1051 |
+
self.next_seg = False
|
1052 |
+
segments.append([start_ms, end_ms])
|
1053 |
+
|
1054 |
+
return segments, in_cache
|
1055 |
+
|
1056 |
+
def get_frames_state(
|
1057 |
+
self,
|
1058 |
+
feats: np.ndarray,
|
1059 |
+
waveform: np.ndarray,
|
1060 |
+
in_cache: list = None,
|
1061 |
+
is_final: bool = False,
|
1062 |
+
max_end_sil: int = 800,
|
1063 |
+
):
|
1064 |
+
feats = [feats]
|
1065 |
+
states = []
|
1066 |
+
if in_cache is None:
|
1067 |
+
in_cache = []
|
1068 |
+
|
1069 |
+
self.max_end_sil_frame_cnt_thresh = (
|
1070 |
+
max_end_sil - self.vad_opts.speech_to_sil_time_thres
|
1071 |
+
)
|
1072 |
+
self.waveform = waveform # compute decibel for each frame
|
1073 |
+
feats.extend(in_cache)
|
1074 |
+
in_cache = self.compute_scores(feats)
|
1075 |
+
self.compute_decibel()
|
1076 |
+
|
1077 |
+
if self.vad_state_machine == VadStateMachine.kVadInStateEndPointDetected:
|
1078 |
+
return states
|
1079 |
+
|
1080 |
+
for i in range(self.vad_opts.nn_eval_block_size - 1, -1, -1):
|
1081 |
+
frame_state = FrameState.kFrameStateInvalid
|
1082 |
+
frame_state = self.get_frame_state(self.frm_cnt - 1 - i)
|
1083 |
+
states.append(frame_state)
|
1084 |
+
if i == 0 and is_final:
|
1085 |
+
logging.info("last frame detected")
|
1086 |
+
self.detect_one_frame(frame_state, self.frm_cnt - 1, True)
|
1087 |
+
else:
|
1088 |
+
self.detect_one_frame(frame_state, self.frm_cnt - 1 - i, False)
|
1089 |
+
|
1090 |
+
return states
|
1091 |
+
|
1092 |
+
def detect_common_frames(self) -> int:
|
1093 |
+
if self.vad_state_machine == VadStateMachine.kVadInStateEndPointDetected:
|
1094 |
+
return 0
|
1095 |
+
for i in range(self.vad_opts.nn_eval_block_size - 1, -1, -1):
|
1096 |
+
frame_state = FrameState.kFrameStateInvalid
|
1097 |
+
frame_state = self.get_frame_state(self.frm_cnt - 1 - i)
|
1098 |
+
# print(f"cur frame: {self.frm_cnt - 1 - i}, state is {frame_state}")
|
1099 |
+
self.detect_one_frame(frame_state, self.frm_cnt - 1 - i, False)
|
1100 |
+
|
1101 |
+
self.decibel = self.decibel[self.vad_opts.nn_eval_block_size - 1 :]
|
1102 |
+
self.decibel_offset = self.frm_cnt - 1 - i
|
1103 |
+
return 0
|
1104 |
+
|
1105 |
+
def detect_last_frames(self) -> int:
|
1106 |
+
if self.vad_state_machine == VadStateMachine.kVadInStateEndPointDetected:
|
1107 |
+
return 0
|
1108 |
+
for i in range(self.vad_opts.nn_eval_block_size - 1, -1, -1):
|
1109 |
+
frame_state = FrameState.kFrameStateInvalid
|
1110 |
+
frame_state = self.get_frame_state(self.frm_cnt - 1 - i)
|
1111 |
+
if i != 0:
|
1112 |
+
self.detect_one_frame(frame_state, self.frm_cnt - 1 - i, False)
|
1113 |
+
else:
|
1114 |
+
self.detect_one_frame(frame_state, self.frm_cnt - 1, True)
|
1115 |
+
|
1116 |
+
return 0
|
1117 |
+
|
1118 |
+
def detect_one_frame(
|
1119 |
+
self, cur_frm_state: FrameState, cur_frm_idx: int, is_final_frame: bool
|
1120 |
+
) -> None:
|
1121 |
+
tmp_cur_frm_state = FrameState.kFrameStateInvalid
|
1122 |
+
if cur_frm_state == FrameState.kFrameStateSpeech:
|
1123 |
+
if math.fabs(1.0) > float(self.vad_opts.fe_prior_thres):
|
1124 |
+
tmp_cur_frm_state = FrameState.kFrameStateSpeech
|
1125 |
+
else:
|
1126 |
+
tmp_cur_frm_state = FrameState.kFrameStateSil
|
1127 |
+
elif cur_frm_state == FrameState.kFrameStateSil:
|
1128 |
+
tmp_cur_frm_state = FrameState.kFrameStateSil
|
1129 |
+
state_change = self.windows_detector.detect_one_frame(
|
1130 |
+
tmp_cur_frm_state, cur_frm_idx
|
1131 |
+
)
|
1132 |
+
frm_shift_in_ms = self.vad_opts.frame_in_ms
|
1133 |
+
if AudioChangeState.kChangeStateSil2Speech == state_change:
|
1134 |
+
self.continous_silence_frame_count = 0
|
1135 |
+
self.pre_end_silence_detected = False
|
1136 |
+
|
1137 |
+
if (
|
1138 |
+
self.vad_state_machine
|
1139 |
+
== VadStateMachine.kVadInStateStartPointNotDetected
|
1140 |
+
):
|
1141 |
+
start_frame = max(
|
1142 |
+
self.data_buf_start_frame,
|
1143 |
+
cur_frm_idx - self.latency_frm_num_at_start_point(),
|
1144 |
+
)
|
1145 |
+
self.on_voice_start(start_frame)
|
1146 |
+
self.vad_state_machine = VadStateMachine.kVadInStateInSpeechSegment
|
1147 |
+
for t in range(start_frame + 1, cur_frm_idx + 1):
|
1148 |
+
self.on_voice_detected(t)
|
1149 |
+
elif self.vad_state_machine == VadStateMachine.kVadInStateInSpeechSegment:
|
1150 |
+
for t in range(self.latest_confirmed_speech_frame + 1, cur_frm_idx):
|
1151 |
+
self.on_voice_detected(t)
|
1152 |
+
if (
|
1153 |
+
cur_frm_idx - self.confirmed_start_frame + 1
|
1154 |
+
> self.vad_opts.max_single_segment_time / frm_shift_in_ms
|
1155 |
+
):
|
1156 |
+
self.on_voice_end(cur_frm_idx, False, False)
|
1157 |
+
self.vad_state_machine = VadStateMachine.kVadInStateEndPointDetected
|
1158 |
+
elif not is_final_frame:
|
1159 |
+
self.on_voice_detected(cur_frm_idx)
|
1160 |
+
else:
|
1161 |
+
self.maybe_on_voice_end_last_frame(is_final_frame, cur_frm_idx)
|
1162 |
+
else:
|
1163 |
+
pass
|
1164 |
+
elif AudioChangeState.kChangeStateSpeech2Sil == state_change:
|
1165 |
+
self.continous_silence_frame_count = 0
|
1166 |
+
if (
|
1167 |
+
self.vad_state_machine
|
1168 |
+
== VadStateMachine.kVadInStateStartPointNotDetected
|
1169 |
+
):
|
1170 |
+
pass
|
1171 |
+
elif self.vad_state_machine == VadStateMachine.kVadInStateInSpeechSegment:
|
1172 |
+
if (
|
1173 |
+
cur_frm_idx - self.confirmed_start_frame + 1
|
1174 |
+
> self.vad_opts.max_single_segment_time / frm_shift_in_ms
|
1175 |
+
):
|
1176 |
+
self.on_voice_end(cur_frm_idx, False, False)
|
1177 |
+
self.vad_state_machine = VadStateMachine.kVadInStateEndPointDetected
|
1178 |
+
elif not is_final_frame:
|
1179 |
+
self.on_voice_detected(cur_frm_idx)
|
1180 |
+
else:
|
1181 |
+
self.maybe_on_voice_end_last_frame(is_final_frame, cur_frm_idx)
|
1182 |
+
else:
|
1183 |
+
pass
|
1184 |
+
elif AudioChangeState.kChangeStateSpeech2Speech == state_change:
|
1185 |
+
self.continous_silence_frame_count = 0
|
1186 |
+
if self.vad_state_machine == VadStateMachine.kVadInStateInSpeechSegment:
|
1187 |
+
if (
|
1188 |
+
cur_frm_idx - self.confirmed_start_frame + 1
|
1189 |
+
> self.vad_opts.max_single_segment_time / frm_shift_in_ms
|
1190 |
+
):
|
1191 |
+
self.max_time_out = True
|
1192 |
+
self.on_voice_end(cur_frm_idx, False, False)
|
1193 |
+
self.vad_state_machine = VadStateMachine.kVadInStateEndPointDetected
|
1194 |
+
elif not is_final_frame:
|
1195 |
+
self.on_voice_detected(cur_frm_idx)
|
1196 |
+
else:
|
1197 |
+
self.maybe_on_voice_end_last_frame(is_final_frame, cur_frm_idx)
|
1198 |
+
else:
|
1199 |
+
pass
|
1200 |
+
elif AudioChangeState.kChangeStateSil2Sil == state_change:
|
1201 |
+
self.continous_silence_frame_count += 1
|
1202 |
+
if (
|
1203 |
+
self.vad_state_machine
|
1204 |
+
== VadStateMachine.kVadInStateStartPointNotDetected
|
1205 |
+
):
|
1206 |
+
# silence timeout, return zero length decision
|
1207 |
+
if (
|
1208 |
+
(
|
1209 |
+
self.vad_opts.detect_mode
|
1210 |
+
== VadDetectMode.kVadSingleUtteranceDetectMode.value
|
1211 |
+
)
|
1212 |
+
and (
|
1213 |
+
self.continous_silence_frame_count * frm_shift_in_ms
|
1214 |
+
> self.vad_opts.max_start_silence_time
|
1215 |
+
)
|
1216 |
+
) or (is_final_frame and self.number_end_time_detected == 0):
|
1217 |
+
for t in range(
|
1218 |
+
self.lastest_confirmed_silence_frame + 1, cur_frm_idx
|
1219 |
+
):
|
1220 |
+
self.on_silence_detected(t)
|
1221 |
+
self.on_voice_start(0, True)
|
1222 |
+
self.on_voice_end(0, True, False)
|
1223 |
+
self.vad_state_machine = VadStateMachine.kVadInStateEndPointDetected
|
1224 |
+
else:
|
1225 |
+
if cur_frm_idx >= self.latency_frm_num_at_start_point():
|
1226 |
+
self.on_silence_detected(
|
1227 |
+
cur_frm_idx - self.latency_frm_num_at_start_point()
|
1228 |
+
)
|
1229 |
+
elif self.vad_state_machine == VadStateMachine.kVadInStateInSpeechSegment:
|
1230 |
+
if (
|
1231 |
+
self.continous_silence_frame_count * frm_shift_in_ms
|
1232 |
+
>= self.max_end_sil_frame_cnt_thresh
|
1233 |
+
):
|
1234 |
+
lookback_frame = int(
|
1235 |
+
self.max_end_sil_frame_cnt_thresh / frm_shift_in_ms
|
1236 |
+
)
|
1237 |
+
if self.vad_opts.do_extend:
|
1238 |
+
lookback_frame -= int(
|
1239 |
+
self.vad_opts.lookahead_time_end_point / frm_shift_in_ms
|
1240 |
+
)
|
1241 |
+
lookback_frame -= 1
|
1242 |
+
lookback_frame = max(0, lookback_frame)
|
1243 |
+
self.on_voice_end(cur_frm_idx - lookback_frame, False, False)
|
1244 |
+
self.vad_state_machine = VadStateMachine.kVadInStateEndPointDetected
|
1245 |
+
elif (
|
1246 |
+
cur_frm_idx - self.confirmed_start_frame + 1
|
1247 |
+
> self.vad_opts.max_single_segment_time / frm_shift_in_ms
|
1248 |
+
):
|
1249 |
+
self.on_voice_end(cur_frm_idx, False, False)
|
1250 |
+
self.vad_state_machine = VadStateMachine.kVadInStateEndPointDetected
|
1251 |
+
elif self.vad_opts.do_extend and not is_final_frame:
|
1252 |
+
if self.continous_silence_frame_count <= int(
|
1253 |
+
self.vad_opts.lookahead_time_end_point / frm_shift_in_ms
|
1254 |
+
):
|
1255 |
+
self.on_voice_detected(cur_frm_idx)
|
1256 |
+
else:
|
1257 |
+
self.maybe_on_voice_end_last_frame(is_final_frame, cur_frm_idx)
|
1258 |
+
else:
|
1259 |
+
pass
|
1260 |
+
|
1261 |
+
if (
|
1262 |
+
self.vad_state_machine == VadStateMachine.kVadInStateEndPointDetected
|
1263 |
+
and self.vad_opts.detect_mode
|
1264 |
+
== VadDetectMode.kVadMutipleUtteranceDetectMode.value
|
1265 |
+
):
|
1266 |
+
self.reset_detection()
|
1267 |
+
|
1268 |
+
|
1269 |
+
class FSMNVad(object):
|
1270 |
+
def __init__(self, config_dir: str):
|
1271 |
+
config_dir = Path(config_dir)
|
1272 |
+
self.config = read_yaml(config_dir / "fsmn-config.yaml")
|
1273 |
+
self.frontend = WavFrontend(
|
1274 |
+
cmvn_file=config_dir / "fsmn-am.mvn",
|
1275 |
+
**self.config["WavFrontend"]["frontend_conf"],
|
1276 |
+
)
|
1277 |
+
self.config["FSMN"]["model_path"] = config_dir / "fsmnvad-offline.onnx"
|
1278 |
+
|
1279 |
+
self.vad = E2EVadModel(
|
1280 |
+
self.config["FSMN"], self.config["vadPostArgs"], config_dir
|
1281 |
+
)
|
1282 |
+
|
1283 |
+
def set_parameters(self, mode):
|
1284 |
+
pass
|
1285 |
+
|
1286 |
+
def extract_feature(self, waveform):
|
1287 |
+
fbank, _ = self.frontend.fbank(waveform)
|
1288 |
+
feats, feats_len = self.frontend.lfr_cmvn(fbank)
|
1289 |
+
return feats.astype(np.float32), feats_len
|
1290 |
+
|
1291 |
+
def is_speech(self, buf, sample_rate=16000):
|
1292 |
+
assert sample_rate == 16000, "only support 16k sample rate"
|
1293 |
+
|
1294 |
+
def segments_offline(self, waveform_path: Union[str, Path, np.ndarray]):
|
1295 |
+
"""get sements of audio"""
|
1296 |
+
|
1297 |
+
if isinstance(waveform_path, np.ndarray):
|
1298 |
+
waveform = waveform_path
|
1299 |
+
else:
|
1300 |
+
if not os.path.exists(waveform_path):
|
1301 |
+
raise FileExistsError(f"{waveform_path} is not exist.")
|
1302 |
+
if os.path.isfile(waveform_path):
|
1303 |
+
logging.info(f"load audio {waveform_path}")
|
1304 |
+
waveform, _sample_rate = sf.read(
|
1305 |
+
waveform_path,
|
1306 |
+
dtype="float32",
|
1307 |
+
)
|
1308 |
+
else:
|
1309 |
+
raise FileNotFoundError(str(Path))
|
1310 |
+
assert (
|
1311 |
+
_sample_rate == 16000
|
1312 |
+
), f"only support 16k sample rate, current sample rate is {_sample_rate}"
|
1313 |
+
|
1314 |
+
feats, feats_len = self.extract_feature(waveform)
|
1315 |
+
waveform = waveform[None, ...]
|
1316 |
+
segments_part, in_cache = self.vad.infer_offline(
|
1317 |
+
feats[None, ...], waveform, is_final=True
|
1318 |
+
)
|
1319 |
+
return segments_part[0]
|
1320 |
+
|
1321 |
+
# ```
|
1322 |
+
|
1323 |
+
# File: sense_voice.py
|
1324 |
+
# ```py
|
1325 |
+
# -*- coding:utf-8 -*-
|
1326 |
+
# @FileName :sense_voice.py.py
|
1327 |
+
# @Time :2024/7/18 15:40
|
1328 |
+
# @Author :lovemefan
|
1329 |
+
# @Email :[email protected]
|
1330 |
+
|
1331 |
+
languages = {"auto": 0, "zh": 3, "en": 4, "yue": 7, "ja": 11, "ko": 12, "nospeech": 13}
|
1332 |
+
formatter = "%(asctime)s %(levelname)s [%(filename)s:%(lineno)d] %(message)s"
|
1333 |
+
logging.basicConfig(format=formatter, level=logging.INFO)
|
1334 |
+
|
1335 |
+
def main():
|
1336 |
+
arg_parser = argparse.ArgumentParser(description="Sense Voice")
|
1337 |
+
arg_parser.add_argument("-a", "--audio_file", required=True, type=str, help="Model")
|
1338 |
+
download_model_path = os.path.dirname(__file__)
|
1339 |
+
arg_parser.add_argument(
|
1340 |
+
"-dp",
|
1341 |
+
"--download_path",
|
1342 |
+
default=download_model_path,
|
1343 |
+
type=str,
|
1344 |
+
help="dir path of resource downloaded",
|
1345 |
+
)
|
1346 |
+
arg_parser.add_argument("-d", "--device", default=-1, type=int, help="Device")
|
1347 |
+
arg_parser.add_argument(
|
1348 |
+
"-n", "--num_threads", default=4, type=int, help="Num threads"
|
1349 |
+
)
|
1350 |
+
arg_parser.add_argument(
|
1351 |
+
"-l",
|
1352 |
+
"--language",
|
1353 |
+
choices=languages.keys(),
|
1354 |
+
default="auto",
|
1355 |
+
type=str,
|
1356 |
+
help="Language",
|
1357 |
+
)
|
1358 |
+
arg_parser.add_argument("--use_itn", action="store_true", help="Use ITN")
|
1359 |
+
args = arg_parser.parse_args()
|
1360 |
+
|
1361 |
+
front = WavFrontend(os.path.join(download_model_path, "am.mvn"))
|
1362 |
+
|
1363 |
+
model = SenseVoiceInferenceSession(
|
1364 |
+
os.path.join(download_model_path, "embedding.npy"),
|
1365 |
+
os.path.join(
|
1366 |
+
download_model_path,
|
1367 |
+
"sense-voice-encoder.rknn",
|
1368 |
+
),
|
1369 |
+
os.path.join(download_model_path, "chn_jpn_yue_eng_ko_spectok.bpe.model"),
|
1370 |
+
args.device,
|
1371 |
+
args.num_threads,
|
1372 |
+
)
|
1373 |
+
waveform, _sample_rate = sf.read(
|
1374 |
+
args.audio_file,
|
1375 |
+
dtype="float32",
|
1376 |
+
always_2d=True
|
1377 |
+
)
|
1378 |
+
|
1379 |
+
logging.info(f"Audio {args.audio_file} is {len(waveform) / _sample_rate} seconds, {waveform.shape[1]} channel")
|
1380 |
+
# load vad model
|
1381 |
+
start = time.time()
|
1382 |
+
vad = FSMNVad(download_model_path)
|
1383 |
+
for channel_id, channel_data in enumerate(waveform.T):
|
1384 |
+
segments = vad.segments_offline(channel_data)
|
1385 |
+
results = ""
|
1386 |
+
for part in segments:
|
1387 |
+
audio_feats = front.get_features(channel_data[part[0] * 16 : part[1] * 16])
|
1388 |
+
asr_result = model(
|
1389 |
+
audio_feats[None, ...],
|
1390 |
+
language=languages[args.language],
|
1391 |
+
use_itn=args.use_itn,
|
1392 |
+
)
|
1393 |
+
logging.info(f"[Channel {channel_id}] [{part[0] / 1000}s - {part[1] / 1000}s] {asr_result}")
|
1394 |
+
vad.vad.all_reset_detection()
|
1395 |
+
decoding_time = time.time() - start
|
1396 |
+
logging.info(f"Decoder audio takes {decoding_time} seconds")
|
1397 |
+
logging.info(f"The RTF is {decoding_time/(waveform.shape[1] * len(waveform) / _sample_rate)}.")
|
1398 |
+
|
1399 |
+
|
1400 |
+
if __name__ == "__main__":
|
1401 |
+
main()
|
1402 |
+
|
wget-log
ADDED
@@ -0,0 +1,6 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
--2025-06-18 23:24:33-- https://storage.googleapis.com/kagglesdsdata/datasets/829978/1417968/harvard.wav?X-Goog-Algorithm=GOOG4-RSA-SHA256
|
2 |
+
Resolving storage.googleapis.com (storage.googleapis.com)... 2607:f8b0:4008:805::201b, 2607:f8b0:4008:80e::201b, 2607:f8b0:4008:804::201b, ...
|
3 |
+
Connecting to storage.googleapis.com (storage.googleapis.com)|2607:f8b0:4008:805::201b|:443... connected.
|
4 |
+
HTTP request sent, awaiting response... 400 Bad Request
|
5 |
+
2025-06-18 23:24:33 ERROR 400: Bad Request.
|
6 |
+
|