llm-urls-neurips nhagar/fineweb_urls Viewer • Updated May 15 • 24.5B • 262 nhagar/fineweb-edu_urls Viewer • Updated May 15 • 1.43B • 24 nhagar/fineweb-2_urls Viewer • Updated May 15 • 4.57B • 30 nhagar/onlysports_dataset_urls Viewer • Updated May 15 • 864M • 7
CC-domain-counts Dumps of aggregate URL counts by domain from Common Crawl snapshots nhagar/CC_MAIN_2017_47_urls Viewer • Updated May 15 • 75.8M • 4 • 1 nhagar/CC_MAIN_2024_18_urls Viewer • Updated May 15 • 64.1M • 5 nhagar/CC-MAIN-2021-17_urls Viewer • Updated May 15 • 55.9M • 3 nhagar/CC-MAIN-2016-40_urls Viewer • Updated May 15 • 55.6M • 3
LLM-training-URLs Lists of URLs from various training datasets nhagar/falcon_urls Viewer • Updated Dec 19, 2024 • 968M • 164 • 1 nhagar/c4_en_urls Viewer • Updated Dec 2, 2024 • 365M • 4 • 2 nhagar/cultura_urls Viewer • Updated Dec 21, 2024 • 7.18B • 73 • 1
llm-urls-neurips nhagar/fineweb_urls Viewer • Updated May 15 • 24.5B • 262 nhagar/fineweb-edu_urls Viewer • Updated May 15 • 1.43B • 24 nhagar/fineweb-2_urls Viewer • Updated May 15 • 4.57B • 30 nhagar/onlysports_dataset_urls Viewer • Updated May 15 • 864M • 7
LLM-training-URLs Lists of URLs from various training datasets nhagar/falcon_urls Viewer • Updated Dec 19, 2024 • 968M • 164 • 1 nhagar/c4_en_urls Viewer • Updated Dec 2, 2024 • 365M • 4 • 2 nhagar/cultura_urls Viewer • Updated Dec 21, 2024 • 7.18B • 73 • 1
CC-domain-counts Dumps of aggregate URL counts by domain from Common Crawl snapshots nhagar/CC_MAIN_2017_47_urls Viewer • Updated May 15 • 75.8M • 4 • 1 nhagar/CC_MAIN_2024_18_urls Viewer • Updated May 15 • 64.1M • 5 nhagar/CC-MAIN-2021-17_urls Viewer • Updated May 15 • 55.9M • 3 nhagar/CC-MAIN-2016-40_urls Viewer • Updated May 15 • 55.6M • 3