HuggingFaceFW-Dev

Enterprise
Activity Feed

AI & ML interests

None defined yet.

Recent Activity

HuggingFaceFW-Dev's activity

davanstrienย 
posted an update 1 day ago
view post
Post
1530
Hacked together a way to log trl GRPO training completions to a ๐Ÿค— dataset repo. This allows you to:

- Track rewards from multiple reward functions
- Treat the completion and rewards from training as a "proper" dataset and do EDA
- Share results for open science

The implementation is super hacky, but I'm curious if people would find this useful.

To push completions to the Hub, you just need two extra parameters:

log_completions=True
log_completions_hub_repo='your-username/repo-name'

Example dataset: davanstrien/test-logs
Colab: https://colab.research.google.com/drive/1wzBFPVthRYYTp-mEYlznLg_e_0Za1M3g

davanstrienย 
posted an update 5 days ago
davanstrienย 
posted an update 7 days ago
view post
Post
1826
How do you make 1M+ Hugging Face models & datasets more discoverable?

davanstrien/Smol-Hub-tldr!

I fine-tuned HuggingFaceTB/SmolLM2-360M to generate one-line summaries from a model or dataset README.

Its own self-description?
"A model for generating concise summaries of model & dataset cards from the Hugging Face Hub"

The goal? Make it easier to find the right models and datasets for your specific needs. It's already powering a semantic search for datasets Space.

It's still a WIP but thanks to @loubnabnl , @anton-l , @eliebak et al, for cooking such a nice base model for fine-tuning small, efficient models for specific domains and tasks. ๐Ÿ™
davanstrienย 
posted an update 8 days ago
davanstrienย 
posted an update 23 days ago
davanstrienย 
posted an update 24 days ago
davanstrienย 
posted an update 25 days ago
view post
Post
2019
๐ŸŒ Big step for multilingual AI data!

The Hugging Face community has rated educational content in languages spoken by 1.6 billion people! New additions:
โ€ข Japanese
โ€ข Italian
โ€ข Old High German

Learn more and contribute: https://huggingface.co/blog/davanstrien/fineweb2-community

These ratings can help enhance training data for major world languages.
  • 1 reply
ยท
nataliaElvย 
posted an update about 1 month ago
view post
Post
1461
New chapter in the Hugging Face NLP course! ๐Ÿค— ๐Ÿš€

We've added a new chapter about the very basics of Argilla to the Hugging Face NLP course. Learn how to set up an Argilla instance, load & annotate datasets, and export them to the Hub.ย 

Any feedback for improvements welcome!

https://huggingface.co/learn/nlp-course/chapter10
davanstrienย 
posted an update about 1 month ago
view post
Post
3068
Introducing scandi-fine-web-cleaner davanstrien/scandi-fine-web-cleaner, the first model trained on FineWeb-C community annotations!

FineWeb2 is a massive multilingual dataset for pre-training language models. Like any web-scale dataset, it contains low-quality content. How can we improve it?

Over the past months, an amazing community of 400+ annotators has been labelling content quality (using Argilla) across 23 languages through the FineWeb-C initiative.

Today, I'm happy to share the first classifier trained on this data.

๐Ÿ” What we've built:

- A lightweight classifier that efficiently removes low-quality content
- 90%+ precision demonstrated on Danish & Swedish
- Can process the 43M+ documents in Danish FineWeb2 with minimal compute

๐ŸŒ Why this matters: The approach can be reproduced for any of the 23 languages in FineWeb-C ( data-is-better-together/fineweb-c). We can improve training data quality at scale without massive compute resources by starting with community annotations and training small, efficient classifiers.

Want to build a classifier for your language? Check out the full blog post with code examples and implementation details: https://danielvanstrien.xyz/posts/2025/FineWeb-c/scandinavian-content-filtering-fineweb.html
  • 1 reply
ยท