view article Article Scaling AI-based Data Processing with Hugging Face + Dask By scj13 and 3 others • Oct 9, 2024 • 31
CC-domain-counts Collection Dumps of aggregate URL counts by domain from Common Crawl snapshots • 96 items • Updated Jan 15 • 1
LLM-training-URLs Collection Lists of URLs from various training datasets • 3 items • Updated Dec 21, 2024 • 1