fang
house111222333
AI & ML interests
None yet
Recent Activity
replied to
BramVanroy's
post
26 days ago
Thanks to popular request, I've just added two subsets to the CommonCrawl-Creative Commons Corpus (C5; https://huggingface.co/datasets/BramVanroy/CommonCrawl-CreativeCommons) so that you do not have to do filtering manually
- C5f (https://huggingface.co/datasets/BramVanroy/CommonCrawl-CreativeCommons-fine): only retains high-quality samples that are also present in FineWeb or FineWeb-2;
- C5r (https://huggingface.co/datasets/BramVanroy/CommonCrawl-CreativeCommons-recommended): additional strict filtering that removes samples with license disagreement, non-commercial licenses, and Wikipedia samples. The latter because you should probably get those from a more reliable source that provides better parsed content.
It goes without saying that these filters lead to a massive reduction in quantity. Doc and token counts are given on the dataset pages.
replied to
BramVanroy's
post
26 days ago
Thanks to popular request, I've just added two subsets to the CommonCrawl-Creative Commons Corpus (C5; https://huggingface.co/datasets/BramVanroy/CommonCrawl-CreativeCommons) so that you do not have to do filtering manually
- C5f (https://huggingface.co/datasets/BramVanroy/CommonCrawl-CreativeCommons-fine): only retains high-quality samples that are also present in FineWeb or FineWeb-2;
- C5r (https://huggingface.co/datasets/BramVanroy/CommonCrawl-CreativeCommons-recommended): additional strict filtering that removes samples with license disagreement, non-commercial licenses, and Wikipedia samples. The latter because you should probably get those from a more reliable source that provides better parsed content.
It goes without saying that these filters lead to a massive reduction in quantity. Doc and token counts are given on the dataset pages.
replied to
BramVanroy's
post
26 days ago
Thanks to popular request, I've just added two subsets to the CommonCrawl-Creative Commons Corpus (C5; https://huggingface.co/datasets/BramVanroy/CommonCrawl-CreativeCommons) so that you do not have to do filtering manually
- C5f (https://huggingface.co/datasets/BramVanroy/CommonCrawl-CreativeCommons-fine): only retains high-quality samples that are also present in FineWeb or FineWeb-2;
- C5r (https://huggingface.co/datasets/BramVanroy/CommonCrawl-CreativeCommons-recommended): additional strict filtering that removes samples with license disagreement, non-commercial licenses, and Wikipedia samples. The latter because you should probably get those from a more reliable source that provides better parsed content.
It goes without saying that these filters lead to a massive reduction in quantity. Doc and token counts are given on the dataset pages.
Organizations
None yet