Polyglot
AI & ML interests
Developing foundation models for low-resource languages.
Recent Activity
Polyglot is an initiative to close the linguistic divide in NLP by developing efficient and accessible foundation models for low-resource languages.
While recent breakthroughs in generative AI have been driven by large-scale foundation models, these advances have largely benefited high-resource languages, leaving many underrepresented languages behind. The current deep learning paradigmâheavily reliant on massive datasets and computing powerâhas unintentionally widened this gap, making it harder for speakers of low-resource languages to access and shape AI technologies that reflect their linguistic and cultural identities.
Polyglot addresses this imbalance by creating tools, models, and datasets that support open, sustainable, and inclusive AI development. We aim to empower researchers and communities working with low-resource languages through high-quality open-source resources, enabling them to build and fine-tune language models tailored to their needs.
Recent Publications đ
- ViTucano: A Portuguese Vision Assitant | GitHub | Collection |
- Tucano: Advancing Neural Text Generation for Portuguese | GitHub | Collection | Paper |
- TeenyTinyLlama: open-source tiny language models trained in Brazilian Portuguese | GitHub | Collection | Paper |
News đ
- [13/01/2025] We release ViTucano, a pair of vision assistants natively pretrained in Portuguese (ViTucano-1b5-v1, ViTucano-2b8-v1).
- [13/01/2025] We release the datasets used to pretrain and fine-tune the ViTucano models: ViTucano-Pretrain and ViTucano-SFT.
- [29/11/2024] Tucano is mentioned on Deutsche Welle: "Cientistas criam maior banco de dados em portuguĂŞs para IA".
- [27/11/2024] Tucano video presentation at the C4AI (USP) [available on YouTube].
- [12/11/2024] "Tucano: Advancing Neural Text Generation for Portuguese" is published as a preprint on ArXiv, with all models and datasets released on Hugging Face.
Community Contributions đ¤
- Demo on how to run inference on ViTucano.
- Demo on how to run inference on Tucano.
- Demo on how to create a simple Chat UI for Tucano using Gradio.
- Tucano OpenVINO is a ported version of Tucano-2b4-Instruct optimized for Intel openVINO inference technology.
Polyglot is a project funded by the Federal Ministry of Education and Research (BMBF) and the Ministry of Culture and Science of the State of North Rhine-Westphalia (MWK) as part of TRA Sustainable Futures (University of Bonn) and the Excellence Strategy of the federal and state governments.
-
Tucano: Advancing Neural Text Generation for Portuguese
Paper ⢠2411.07854 ⢠Published ⢠6 -
TucanoBR/Tucano-2b4
Text Generation ⢠2B ⢠Updated ⢠2.29k ⢠4 -
TucanoBR/Tucano-2b4-Instruct
Text Generation ⢠2B ⢠Updated ⢠1.77k ⢠2 -
TucanoBR/Tucano-1b1
Text Generation ⢠1B ⢠Updated ⢠1.77k ⢠2
-
Tucano: Advancing Neural Text Generation for Portuguese
Paper ⢠2411.07854 ⢠Published ⢠6 -
TucanoBR/Tucano-2b4
Text Generation ⢠2B ⢠Updated ⢠2.29k ⢠4 -
TucanoBR/Tucano-2b4-Instruct
Text Generation ⢠2B ⢠Updated ⢠1.77k ⢠2 -
TucanoBR/Tucano-1b1
Text Generation ⢠1B ⢠Updated ⢠1.77k ⢠2