Tucano - a TucanoBR Collection

TucanoBR 's Collections

ViTucano-v1

Tucano

updated May 31

Tucano is a series of decoder-transformers based on the Llama 2 architecture, natively pre-trained in Portuguese.

Upvote

Tucano: Advancing Neural Text Generation for Portuguese

Paper • 2411.07854 • Published Nov 12, 2024 • 6
TucanoBR/Tucano-2b4

Text Generation • 2B • Updated Jan 15 • 1.35k • 4

Note 2.4 billion-parameter version of the Tucano series.
TucanoBR/Tucano-2b4-Instruct

Text Generation • 2B • Updated Jan 15 • 1.53k • 2

Note 2.4 billion-parameter version of the Tucano fine-tuned on the TucanoBR/Tucano-SFT dataset.
TucanoBR/Tucano-1b1

Text Generation • 1B • Updated Jan 15 • 1.05k • 2

Note 1.1 billion-parameter version of the Tucano series.
TucanoBR/Tucano-1b1-Instruct

Text Generation • 1B • Updated Jan 15 • 129 • 1

Note 1.1 billion-parameter version of the Tucano fine-tuned on the TucanoBR/Tucano-SFT dataset.
TucanoBR/Tucano-630m

Text Generation • 0.6B • Updated Jan 15 • 63 • 2

Note 630 million-parameter version of the Tucano series.
TucanoBR/Tucano-160m

Text Generation • 0.2B • Updated Jan 15 • 1.3k • 2

Note 160 million-parameter version of the Tucan series.
TucanoBR/BERTimbau-large-text-filter

Text Classification • 0.3B • Updated Nov 13, 2024 • 6

Note BERTimbau-large fine-tuned on the TucanoBR/GigaVerbo-Text-Filter dataset.
TucanoBR/BERTimbau-base-text-filter

Text Classification • 0.1B • Updated Nov 13, 2024 • 10

Note BERTimbau-base fine-tuned on the TucanoBR/GigaVerbo-Text-Filter dataset.
TucanoBR/XGBClassifier-text-filter

Updated Nov 13, 2024

Note XGBClassifier trained on the TucanoBR/GigaVerbo-Text-Filter dataset (requires the embeddings generated by sentence-transformers/LaBSE).
TucanoBR/XGBRegressor-text-filter

Updated Nov 13, 2024

Note XGBRegressor trained on the TucanoBR/GigaVerbo-Text-Filter dataset (requires the embeddings generated by sentence-transformers/LaBSE).
TucanoBR/GigaVerbo

Viewer • Updated Nov 13, 2024 • 145M • 1.28k • 20
Note GigaVerbo is an extensive dataset comprising 780 GB of Portuguese text, being a concatenated version of several datasets available in Hugging Face, containing over 200 billion tokens.
TucanoBR/GigaVerbo-Text-Filter

Viewer • Updated Nov 13, 2024 • 110k • 119 • 1

Note GigaVerbo Text-Filter is a dataset with 110,000 randomly selected samples from 9 subsets of GigaVerbo, all scored by GPT-4o.
TucanoBR/Tucano-SFT

Viewer • Updated Nov 13, 2024 • 680k • 135 • 1

Note This is the dataset used to train the "Instruct" versions of the Tucano series.
TucanoBR/lambada-pt

Viewer • Updated Nov 7, 2024 • 5.15k • 16 • 2

Note This dataset is a translated version (Portuguese) of the LAMBADA test split as pre-processed by OpenAI.
TucanoBR/alpaca-eval-pt

Viewer • Updated Nov 11, 2024 • 805 • 14

Note This dataset contains 805 translated samples (Portuguese) from the Alpaca dataset.
nicholasKluge/reward-aira-dataset

Viewer • Updated Jun 18, 2024 • 70k • 64 • 3

Note This dataset contains pairs of completions to prompts. Used for DPO fine-tuning.

Upvote