Hugging Face's logo Hugging Face
  • Models
  • Datasets
  • Spaces
  • Docs
  • Enterprise
  • Pricing

  • Log In
  • Sign Up
BramVanroy 's Collections
CommonCrawl-Creative Commons (C5)
Fietje 2
🐐 GEITje 7B ultra 🤖
SFT & RL datasets for Dutch
Dutch Simplification
Multilingual text-to-AMR
Leesplank 2023-2024
Llama 2 & Falcon finetunes
BLEURT

CommonCrawl-Creative Commons (C5)

updated 24 days ago

Raw CommonCrawl crawls, annotated with Creative Commons license information

Upvote
-

  • BramVanroy/CommonCrawl-CreativeCommons

    Viewer • Updated 11 days ago • 739M • 1.08k • 31

  • BramVanroy/CommonCrawl-CreativeCommons-fine

    Viewer • Updated 11 days ago • 75.1M • 1.22k • 1

    Note Only retaining samples that are also present in FineWeb or FineWeb-2


  • BramVanroy/CommonCrawl-CreativeCommons-strict

    Viewer • Updated 11 days ago • 32.8M • 771 • 1

    Note Strong filters, only retaining FineWeb data, removing non-commercial data, removing Wiki data

Upvote
-
  • Collection guide
  • Browse collections
Company
TOS Privacy About Jobs
Website
Models Datasets Spaces Pricing Docs