An LLM pre-training dataset containing only public domain and openly licensed text
Nikhil Kandpal
nkandpa2
AI & ML interests
None yet
Recent Activity
updated
a dataset
24 minutes ago
common-pile/wikiteam_filtered
updated
a dataset
about 2 hours ago
common-pile/wikiteam
updated
a dataset
about 20 hours ago
common-pile/wikimedia
Organizations
Collections
1
Papers
1
models
7
nkandpa2/comma-v0.1-checkpoints
Updated
•
154
nkandpa2/comma-v0.1-stage2
Updated
•
1
nkandpa2/comma-v0.1-stage1
Updated
•
1
nkandpa2/comma-v0.1-checkpoint-hf
Updated
•
14
nkandpa2/comma-v0.1-ablation-hf
Updated
•
1
nkandpa2/comma-loss-test
Text Generation
•
Updated
•
4
nkandpa2/Llama_3.2_1B__alpaca_finetune
Updated
•
2
datasets
10
nkandpa2/cccc_all_domains
Preview
•
Updated
•
6
•
13
nkandpa2/kl3m_fw_ablation
Viewer
•
Updated
•
12.8M
•
227
nkandpa2/commoncorpus_fw_ablation
Viewer
•
Updated
•
14.2M
•
194
nkandpa2/commoncorpus_en_code_fw_ablation
Viewer
•
Updated
•
15M
•
115
nkandpa2/common-pile-filtered
Viewer
•
Updated
•
1.43B
•
12.2k
•
1
nkandpa2/mediawiki-dolma
Viewer
•
Updated
•
77.1M
•
48
nkandpa2/wiki-dolma
Viewer
•
Updated
•
78.1M
•
389
nkandpa2/usgpo
Viewer
•
Updated
•
473k
•
7
nkandpa2/qa_entities
Viewer
•
Updated
•
190k
•
35
•
1
nkandpa2/pretraining_entities
Updated
•
60
•
4