Embeddings datasets ⚡️ This collection gather datasets for embeddings pre-training and fine-tuning. lightonai/embeddings-pre-training Viewer • Updated 1 day ago • 694M • 3.47k • 9 lightonai/nanobeir-multilingual Viewer • Updated 3 days ago • 522k • 83 • 2
ModernBERT Bringing BERT into modernity via both architecture changes and scaling answerdotai/ModernBERT-base Fill-Mask • 0.1B • Updated Jan 15 • 1.45M • 933 lightonai/GTE-ModernColBERT-v1 Sentence Similarity • 0.1B • Updated 9 days ago • 14.8k • • 139 lightonai/Reason-ModernColBERT Sentence Similarity • 0.1B • Updated 9 days ago • 2.91k • • 204 lightonai/modernbert-embed-large Sentence Similarity • 0.4B • Updated May 14 • 2.48k • • 26
RITA 🧿 A suite of autoregressive generative models for protein sequences, with up to 1.2Bparameters, trained on over 280 million protein sequences. lightonai/RITA_s Text Generation • 0.1B • Updated Nov 13, 2024 • 67 • 3 lightonai/RITA_m Text Generation • 0.3B • Updated Jan 6 • 10 lightonai/RITA_l Text Generation • Updated May 19, 2022 • 9 lightonai/RITA_xl Text Generation • 1B • Updated Dec 10, 2024 • 169 • 3
ArabicWeb24-ablation-models 900M models trained on 25BT to compare different data processing choices (filtering, sentence dedup, minhash, etc) lightonai/ArabicWeb24-ablation-model-v1 Text Generation • Updated Aug 19, 2024 • 10 lightonai/ArabicWeb24-ablation-model-v5 Text Generation • Updated Aug 19, 2024 • 7
PyLate 🐕 lightonai/Reason-ModernColBERT Sentence Similarity • 0.1B • Updated 9 days ago • 2.91k • • 204 lightonai/GTE-ModernColBERT-v1 Sentence Similarity • 0.1B • Updated 9 days ago • 14.8k • • 139 lightonai/answerai-colbert-small-v1 Sentence Similarity • 0.0B • Updated Jun 30 • 238 • 3 lightonai/colbertv2.0 Sentence Similarity • 0.1B • Updated Feb 10 • 4.34k • • 4
PAGnol 🇫🇷 French language models. These model were trained in early 2021 following the then scaling laws and using the exact same training data as the CamemBERT lightonai/pagnol-small Text Generation • Updated Mar 21, 2024 • 11 • 1 lightonai/pagnol-medium Text Generation • 0.4B • Updated Jan 6 • 9 • 1 lightonai/pagnol-large Text Generation • Updated Mar 24, 2024 • 7 • 1 lightonai/pagnol-xl Text Generation • 2B • Updated Nov 7, 2024 • 21 • 1
Embeddings datasets ⚡️ This collection gather datasets for embeddings pre-training and fine-tuning. lightonai/embeddings-pre-training Viewer • Updated 1 day ago • 694M • 3.47k • 9 lightonai/nanobeir-multilingual Viewer • Updated 3 days ago • 522k • 83 • 2
PyLate 🐕 lightonai/Reason-ModernColBERT Sentence Similarity • 0.1B • Updated 9 days ago • 2.91k • • 204 lightonai/GTE-ModernColBERT-v1 Sentence Similarity • 0.1B • Updated 9 days ago • 14.8k • • 139 lightonai/answerai-colbert-small-v1 Sentence Similarity • 0.0B • Updated Jun 30 • 238 • 3 lightonai/colbertv2.0 Sentence Similarity • 0.1B • Updated Feb 10 • 4.34k • • 4
ModernBERT Bringing BERT into modernity via both architecture changes and scaling answerdotai/ModernBERT-base Fill-Mask • 0.1B • Updated Jan 15 • 1.45M • 933 lightonai/GTE-ModernColBERT-v1 Sentence Similarity • 0.1B • Updated 9 days ago • 14.8k • • 139 lightonai/Reason-ModernColBERT Sentence Similarity • 0.1B • Updated 9 days ago • 2.91k • • 204 lightonai/modernbert-embed-large Sentence Similarity • 0.4B • Updated May 14 • 2.48k • • 26
PAGnol 🇫🇷 French language models. These model were trained in early 2021 following the then scaling laws and using the exact same training data as the CamemBERT lightonai/pagnol-small Text Generation • Updated Mar 21, 2024 • 11 • 1 lightonai/pagnol-medium Text Generation • 0.4B • Updated Jan 6 • 9 • 1 lightonai/pagnol-large Text Generation • Updated Mar 24, 2024 • 7 • 1 lightonai/pagnol-xl Text Generation • 2B • Updated Nov 7, 2024 • 21 • 1
RITA 🧿 A suite of autoregressive generative models for protein sequences, with up to 1.2Bparameters, trained on over 280 million protein sequences. lightonai/RITA_s Text Generation • 0.1B • Updated Nov 13, 2024 • 67 • 3 lightonai/RITA_m Text Generation • 0.3B • Updated Jan 6 • 10 lightonai/RITA_l Text Generation • Updated May 19, 2022 • 9 lightonai/RITA_xl Text Generation • 1B • Updated Dec 10, 2024 • 169 • 3
ArabicWeb24-ablation-models 900M models trained on 25BT to compare different data processing choices (filtering, sentence dedup, minhash, etc) lightonai/ArabicWeb24-ablation-model-v1 Text Generation • Updated Aug 19, 2024 • 10 lightonai/ArabicWeb24-ablation-model-v5 Text Generation • Updated Aug 19, 2024 • 7