fang's picture

fang

house111222333

AI & ML interests

None yet

Recent Activity

replied to BramVanroy's post 26 days ago

Thanks to popular request, I've just added two subsets to the CommonCrawl-Creative Commons Corpus (C5; https://huggingface.co/datasets/BramVanroy/CommonCrawl-CreativeCommons) so that you do not have to do filtering manually - C5f (https://huggingface.co/datasets/BramVanroy/CommonCrawl-CreativeCommons-fine): only retains high-quality samples that are also present in FineWeb or FineWeb-2; - C5r (https://huggingface.co/datasets/BramVanroy/CommonCrawl-CreativeCommons-recommended): additional strict filtering that removes samples with license disagreement, non-commercial licenses, and Wikipedia samples. The latter because you should probably get those from a more reliable source that provides better parsed content. It goes without saying that these filters lead to a massive reduction in quantity. Doc and token counts are given on the dataset pages.

replied to BramVanroy's post 26 days ago

Thanks to popular request, I've just added two subsets to the CommonCrawl-Creative Commons Corpus (C5; https://huggingface.co/datasets/BramVanroy/CommonCrawl-CreativeCommons) so that you do not have to do filtering manually - C5f (https://huggingface.co/datasets/BramVanroy/CommonCrawl-CreativeCommons-fine): only retains high-quality samples that are also present in FineWeb or FineWeb-2; - C5r (https://huggingface.co/datasets/BramVanroy/CommonCrawl-CreativeCommons-recommended): additional strict filtering that removes samples with license disagreement, non-commercial licenses, and Wikipedia samples. The latter because you should probably get those from a more reliable source that provides better parsed content. It goes without saying that these filters lead to a massive reduction in quantity. Doc and token counts are given on the dataset pages.

replied to BramVanroy's post 26 days ago

Thanks to popular request, I've just added two subsets to the CommonCrawl-Creative Commons Corpus (C5; https://huggingface.co/datasets/BramVanroy/CommonCrawl-CreativeCommons) so that you do not have to do filtering manually - C5f (https://huggingface.co/datasets/BramVanroy/CommonCrawl-CreativeCommons-fine): only retains high-quality samples that are also present in FineWeb or FineWeb-2; - C5r (https://huggingface.co/datasets/BramVanroy/CommonCrawl-CreativeCommons-recommended): additional strict filtering that removes samples with license disagreement, non-commercial licenses, and Wikipedia samples. The latter because you should probably get those from a more reliable source that provides better parsed content. It goes without saying that these filters lead to a massive reduction in quantity. Doc and token counts are given on the dataset pages.

View all activity

Organizations

None yet

models 0

None public yet

datasets 0

None public yet