Update Readme (#14)
Browse files- Update Readme (575ad2f58fa0fbb43578c1d507f53cb1e35d8600)
Co-authored-by: Jagadeesh Balam <[email protected]>
README.md
CHANGED
|
@@ -273,6 +273,15 @@ img {
|
|
| 273 |
NVIDIA NeMo Canary Flash [1] is a family of multilingual multi-tasking models based on Canary architecture [2] that achieve state-of-the-art performance on multiple speech benchmarks. With 883 million parameters and an inference speed of more than 1000 RTFx (on open-asr-leaderboard datasets), canary-1b-flash supports automatic speech-to-text recognition (ASR) in four languages (English, German, French, Spanish) and translation from English to German/French/Spanish and from German/French/Spanish to English with or without punctuation and capitalization (PnC). Additionally, canary-1b-flash offers an experimental feature for word-level and segment-level timestamps in English, German, French, and Spanish.
|
| 274 |
This model is released under the permissive CC-BY-4.0 license and is available for commercial use.
|
| 275 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 276 |
|
| 277 |
## Model Architecture:
|
| 278 |
Canary is an encoder-decoder model with FastConformer [3] Encoder and Transformer Decoder [4]. With audio features extracted from the encoder, task tokens such as \<target language\>, \<task\>, \<toggle timestamps\> and \<toggle PnC\> are fed into the Transformer Decoder to trigger the text generation process. Canary uses a concatenated tokenizer [5] from individual SentencePiece [6] tokenizers of each language, which makes it easy to scale up to more languages. The canary-1b-flash model has 32 encoder layers and 4 decoder layers, leading to a total of 883M parameters. For more details about the architecture, please refer to [1].
|
|
@@ -616,8 +625,6 @@ As outlined in the paper "Towards Measuring Fairness in AI: the Casual Conversat
|
|
| 616 |
|
| 617 |
(Error rates for fairness evaluation are determined by normalizing both the reference and predicted text, similar to the methods used in the evaluations found at https://github.com/huggingface/open_asr_leaderboard.)
|
| 618 |
|
| 619 |
-
## License/Terms of Use:
|
| 620 |
-
canary-1b-flash is released under the CC-BY-4.0 license. By using this model, you are agreeing to the [terms and conditions](https://choosealicense.com/licenses/cc-by-4.0/) of the license. <br>
|
| 621 |
|
| 622 |
## References:
|
| 623 |
[1] [Training and Inference Efficiency of Encoder-Decoder Speech Models](https://arxiv.org/abs/2503.05931)
|
|
|
|
| 273 |
NVIDIA NeMo Canary Flash [1] is a family of multilingual multi-tasking models based on Canary architecture [2] that achieve state-of-the-art performance on multiple speech benchmarks. With 883 million parameters and an inference speed of more than 1000 RTFx (on open-asr-leaderboard datasets), canary-1b-flash supports automatic speech-to-text recognition (ASR) in four languages (English, German, French, Spanish) and translation from English to German/French/Spanish and from German/French/Spanish to English with or without punctuation and capitalization (PnC). Additionally, canary-1b-flash offers an experimental feature for word-level and segment-level timestamps in English, German, French, and Spanish.
|
| 274 |
This model is released under the permissive CC-BY-4.0 license and is available for commercial use.
|
| 275 |
|
| 276 |
+
## Discover more from NVIDIA:
|
| 277 |
+
For documentation, deployment guides, enterprise-ready APIs, and the latest open models—including Nemotron and other cutting-edge speech, translation, and generative AI—visit the NVIDIA Developer Portal at [developer.nvidia.com](developer.nvidia.com).
|
| 278 |
+
Join the community to access tools, support, and resources to accelerate your development with NVIDIA’s NeMo, Riva, NIM, and foundation models.<br>
|
| 279 |
+
|
| 280 |
+
### Explore more from NVIDIA: <br>
|
| 281 |
+
What is [Nemotron](https://www.nvidia.com/en-us/ai-data-science/foundation-models/nemotron/)?<br>
|
| 282 |
+
NVIDIA Developer [Nemotron](https://developer.nvidia.com/nemotron)<br>
|
| 283 |
+
[NVIDIA Riva Speech](https://developer.nvidia.com/riva?sortBy=developer_learning_library%2Fsort%2Ffeatured_in.riva%3Adesc%2Ctitle%3Aasc#demos)<br>
|
| 284 |
+
[NeMo Documentation](https://docs.nvidia.com/nemo-framework/user-guide/latest/nemotoolkit/asr/models.html)<br>
|
| 285 |
|
| 286 |
## Model Architecture:
|
| 287 |
Canary is an encoder-decoder model with FastConformer [3] Encoder and Transformer Decoder [4]. With audio features extracted from the encoder, task tokens such as \<target language\>, \<task\>, \<toggle timestamps\> and \<toggle PnC\> are fed into the Transformer Decoder to trigger the text generation process. Canary uses a concatenated tokenizer [5] from individual SentencePiece [6] tokenizers of each language, which makes it easy to scale up to more languages. The canary-1b-flash model has 32 encoder layers and 4 decoder layers, leading to a total of 883M parameters. For more details about the architecture, please refer to [1].
|
|
|
|
| 625 |
|
| 626 |
(Error rates for fairness evaluation are determined by normalizing both the reference and predicted text, similar to the methods used in the evaluations found at https://github.com/huggingface/open_asr_leaderboard.)
|
| 627 |
|
|
|
|
|
|
|
| 628 |
|
| 629 |
## References:
|
| 630 |
[1] [Training and Inference Efficiency of Encoder-Decoder Speech Models](https://arxiv.org/abs/2503.05931)
|