view article Article Ulysses Sequence Parallelism: Training with Million-Token Contexts 5 days ago • 17
view article Article FlashHead: Accelerating Language Model Inference ~ *Efficient drop-in replacement for the classification head* 2 days ago • 1
Nemotron-Pre-Training-Datasets Collection Large scale pre-training datasets used in the Nemotron family of models. • 12 items • Updated 2 days ago • 117
Lost in Backpropagation: The LM Head is a Gradient Bottleneck Paper • 2603.10145 • Published 3 days ago • 5
NVIDIA Nemotron v3 Collection Open, Production-ready Enterprise Models • 12 items • Updated 2 days ago • 194
MixtureVitae study models and datasets Collection Collection of models and dataset related to MixtureVitae, open and fully reproducible pretraining dataset built from permissive sources • 16 items • Updated 29 days ago • 1
view article Article Scaling Pedagogical Pre-training: From Optimal Mixing to 10 Billion Tokens 8 days ago • 4
🤏 Smol-Data Collection Tried and tested mixes for strong pretraining. Inspired by https://huggingface.co/blog/codelion/optimal-dataset-mixing • 14 items • Updated 11 days ago • 12
Recovered in Translation: Efficient Pipeline for Automated Translation of Benchmarks and Datasets Paper • 2602.22207 • Published 16 days ago • 42
view article Article Do Bubbles Form When Tens of Thousands of AIs Simulate Capitalism? 18 days ago • 17
The Million-Label NER: Breaking Scale Barriers with GLiNER bi-encoder Paper • 2602.18487 • Published about 1 month ago • 5
Avey B1 experimental Collection Experimental pre-trained checkpoints for Avey-B1 • 3 items • Updated 19 days ago • 3
jina-embeddings-v5-text: Task-Targeted Embedding Distillation Paper • 2602.15547 • Published 25 days ago • 26
Aya Datasets Collection The Aya Collection is a massive multilingual collection for over 100 languages consisting of 513 million instances of prompts and completions. • 5 items • Updated Jul 31, 2025 • 27
LoRA-Squeeze: Simple and Effective Post-Tuning and In-Tuning Compression of LoRA Modules Paper • 2602.10993 • Published about 1 month ago • 1
Data Repetition Beats Data Scaling in Long-CoT Supervised Fine-Tuning Paper • 2602.11149 • Published about 1 month ago • 15
SteuerLLM: Local specialized large language model for German tax law analysis Paper • 2602.11081 • Published about 1 month ago • 1