nithinraok jbalam-nv commited on
Commit
2e52411
·
verified ·
1 Parent(s): 25f4b03

Update Readme (#14)

Browse files

- Update Readme (575ad2f58fa0fbb43578c1d507f53cb1e35d8600)


Co-authored-by: Jagadeesh Balam <[email protected]>

Files changed (1) hide show
  1. README.md +9 -2
README.md CHANGED
@@ -273,6 +273,15 @@ img {
273
  NVIDIA NeMo Canary Flash [1] is a family of multilingual multi-tasking models based on Canary architecture [2] that achieve state-of-the-art performance on multiple speech benchmarks. With 883 million parameters and an inference speed of more than 1000 RTFx (on open-asr-leaderboard datasets), canary-1b-flash supports automatic speech-to-text recognition (ASR) in four languages (English, German, French, Spanish) and translation from English to German/French/Spanish and from German/French/Spanish to English with or without punctuation and capitalization (PnC). Additionally, canary-1b-flash offers an experimental feature for word-level and segment-level timestamps in English, German, French, and Spanish.
274
  This model is released under the permissive CC-BY-4.0 license and is available for commercial use.
275
 
 
 
 
 
 
 
 
 
 
276
 
277
  ## Model Architecture:
278
  Canary is an encoder-decoder model with FastConformer [3] Encoder and Transformer Decoder [4]. With audio features extracted from the encoder, task tokens such as \<target language\>, \<task\>, \<toggle timestamps\> and \<toggle PnC\> are fed into the Transformer Decoder to trigger the text generation process. Canary uses a concatenated tokenizer [5] from individual SentencePiece [6] tokenizers of each language, which makes it easy to scale up to more languages. The canary-1b-flash model has 32 encoder layers and 4 decoder layers, leading to a total of 883M parameters. For more details about the architecture, please refer to [1].
@@ -616,8 +625,6 @@ As outlined in the paper "Towards Measuring Fairness in AI: the Casual Conversat
616
 
617
  (Error rates for fairness evaluation are determined by normalizing both the reference and predicted text, similar to the methods used in the evaluations found at https://github.com/huggingface/open_asr_leaderboard.)
618
 
619
- ## License/Terms of Use:
620
- canary-1b-flash is released under the CC-BY-4.0 license. By using this model, you are agreeing to the [terms and conditions](https://choosealicense.com/licenses/cc-by-4.0/) of the license. <br>
621
 
622
  ## References:
623
  [1] [Training and Inference Efficiency of Encoder-Decoder Speech Models](https://arxiv.org/abs/2503.05931)
 
273
  NVIDIA NeMo Canary Flash [1] is a family of multilingual multi-tasking models based on Canary architecture [2] that achieve state-of-the-art performance on multiple speech benchmarks. With 883 million parameters and an inference speed of more than 1000 RTFx (on open-asr-leaderboard datasets), canary-1b-flash supports automatic speech-to-text recognition (ASR) in four languages (English, German, French, Spanish) and translation from English to German/French/Spanish and from German/French/Spanish to English with or without punctuation and capitalization (PnC). Additionally, canary-1b-flash offers an experimental feature for word-level and segment-level timestamps in English, German, French, and Spanish.
274
  This model is released under the permissive CC-BY-4.0 license and is available for commercial use.
275
 
276
+ ## Discover more from NVIDIA:
277
+ For documentation, deployment guides, enterprise-ready APIs, and the latest open models—including Nemotron and other cutting-edge speech, translation, and generative AI—visit the NVIDIA Developer Portal at [developer.nvidia.com](developer.nvidia.com).
278
+ Join the community to access tools, support, and resources to accelerate your development with NVIDIA’s NeMo, Riva, NIM, and foundation models.<br>
279
+
280
+ ### Explore more from NVIDIA: <br>
281
+ What is [Nemotron](https://www.nvidia.com/en-us/ai-data-science/foundation-models/nemotron/)?<br>
282
+ NVIDIA Developer [Nemotron](https://developer.nvidia.com/nemotron)<br>
283
+ [NVIDIA Riva Speech](https://developer.nvidia.com/riva?sortBy=developer_learning_library%2Fsort%2Ffeatured_in.riva%3Adesc%2Ctitle%3Aasc#demos)<br>
284
+ [NeMo Documentation](https://docs.nvidia.com/nemo-framework/user-guide/latest/nemotoolkit/asr/models.html)<br>
285
 
286
  ## Model Architecture:
287
  Canary is an encoder-decoder model with FastConformer [3] Encoder and Transformer Decoder [4]. With audio features extracted from the encoder, task tokens such as \<target language\>, \<task\>, \<toggle timestamps\> and \<toggle PnC\> are fed into the Transformer Decoder to trigger the text generation process. Canary uses a concatenated tokenizer [5] from individual SentencePiece [6] tokenizers of each language, which makes it easy to scale up to more languages. The canary-1b-flash model has 32 encoder layers and 4 decoder layers, leading to a total of 883M parameters. For more details about the architecture, please refer to [1].
 
625
 
626
  (Error rates for fairness evaluation are determined by normalizing both the reference and predicted text, similar to the methods used in the evaluations found at https://github.com/huggingface/open_asr_leaderboard.)
627
 
 
 
628
 
629
  ## References:
630
  [1] [Training and Inference Efficiency of Encoder-Decoder Speech Models](https://arxiv.org/abs/2503.05931)