stereoplegic 's Collections Dataset curation
updated
From Quantity to Quality: Boosting LLM Performance with Self-Guided Data
Selection for Instruction Tuning
Paper
• 2308.12032
• Published
• 2
Know thy corpus! Robust methods for digital curation of Web corpora
Paper
• 2003.06389
• Published
• 1
Self-Alignment with Instruction Backtranslation
Paper
• 2308.06259
• Published
• 43
The Vault: A Comprehensive Multilingual Dataset for Advancing Code
Understanding and Generation
Paper
• 2305.06156
• Published
• 2
End-to-end Knowledge Retrieval with Multi-modal Queries
Paper
• 2306.00424
• Published
• 1
SPRING: GPT-4 Out-performs RL Algorithms by Studying Papers and
Reasoning
Paper
• 2305.15486
• Published
• 1
Pretraining task diversity and the emergence of non-Bayesian in-context
learning for regression
Paper
• 2306.15063
• Published
• 1
NLP From Scratch Without Large-Scale Pretraining: A Simple and Efficient
Framework
Paper
• 2111.04130
• Published
• 1
Oasis: Data Curation and Assessment System for Pretraining of Large
Language Models
Paper
• 2311.12537
• Published
• 1
Multimodal ArXiv: A Dataset for Improving Scientific Comprehension of
Large Vision-Language Models
Paper
• 2403.00231
• Published
• 2
DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open
Language Models
Paper
• 2402.03300
• Published
• 140
Automated Data Curation for Robust Language Model Fine-Tuning
Paper
• 2403.12776
• Published
Better Synthetic Data by Retrieving and Transforming Existing Datasets
Paper
• 2404.14361
• Published
• 2
Automatic Data Curation for Self-Supervised Learning: A Clustering-Based
Approach
Paper
• 2405.15613
• Published
• 17
SemCoder: Training Code Language Models with Comprehensive Semantics
Paper
• 2406.01006
• Published
• 1
Glot500: Scaling Multilingual Corpora and Language Models to 500
Languages
Paper
• 2305.12182
• Published
• 1