Update README.md
Browse filesAdd Granary arxiv link
README.md
CHANGED
@@ -300,7 +300,7 @@ Training was conducted using this [example script](https://github.com/NVIDIA/NeM
|
|
300 |
The tokenizer was constructed from the training set transcripts using this [script](https://github.com/NVIDIA/NeMo/blob/main/scripts/tokenizers/process_asr_text_tokenizer.py).
|
301 |
|
302 |
### <span style="color:#466f00;">Training Dataset</span>
|
303 |
-
The model was trained on the Granary dataset, consisting of approximately 120,000 hours of English speech data:
|
304 |
|
305 |
- 10,000 hours from human-transcribed NeMo ASR Set 3.0, including:
|
306 |
- LibriSpeech (960 hours)
|
@@ -318,7 +318,7 @@ The model was trained on the Granary dataset, consisting of approximately 120,00
|
|
318 |
- YODAS dataset [5]
|
319 |
- Librilight [7]
|
320 |
|
321 |
-
All transcriptions preserve punctuation and capitalization. The Granary dataset will be made publicly available after presentation at Interspeech 2025.
|
322 |
|
323 |
**Data Collection Method by dataset**
|
324 |
|
@@ -398,6 +398,8 @@ These WER scores were obtained using greedy decoding without an external languag
|
|
398 |
|
399 |
[7] [MOSEL: 950,000 Hours of Speech Data for Open-Source Speech Foundation Model Training on EU Languages](https://arxiv.org/abs/2410.01036)
|
400 |
|
|
|
|
|
401 |
## <span style="color:#466f00;">Inference:</span>
|
402 |
|
403 |
**Engine**:
|
|
|
300 |
The tokenizer was constructed from the training set transcripts using this [script](https://github.com/NVIDIA/NeMo/blob/main/scripts/tokenizers/process_asr_text_tokenizer.py).
|
301 |
|
302 |
### <span style="color:#466f00;">Training Dataset</span>
|
303 |
+
The model was trained on the Granary dataset[8], consisting of approximately 120,000 hours of English speech data:
|
304 |
|
305 |
- 10,000 hours from human-transcribed NeMo ASR Set 3.0, including:
|
306 |
- LibriSpeech (960 hours)
|
|
|
318 |
- YODAS dataset [5]
|
319 |
- Librilight [7]
|
320 |
|
321 |
+
All transcriptions preserve punctuation and capitalization. The Granary dataset[8] will be made publicly available after presentation at Interspeech 2025.
|
322 |
|
323 |
**Data Collection Method by dataset**
|
324 |
|
|
|
398 |
|
399 |
[7] [MOSEL: 950,000 Hours of Speech Data for Open-Source Speech Foundation Model Training on EU Languages](https://arxiv.org/abs/2410.01036)
|
400 |
|
401 |
+
[8] [Granary: Speech Recognition and Translation Dataset in 25 European Languages](https://arxiv.org/pdf/2505.13404)
|
402 |
+
|
403 |
## <span style="color:#466f00;">Inference:</span>
|
404 |
|
405 |
**Engine**:
|