AI & ML interests

None defined yet.

Recent Activity

nanotron's activity

nouamanetazi 
published a Space 9 days ago
thomwolf 
posted an update about 2 months ago
view post
Post
5065
We are proud to announce HuggingFaceFW/fineweb-2: A sparkling update to HuggingFaceFW/fineweb with 1000s of 🗣️languages.

We applied the same data-driven approach that led to SOTA English performance in🍷 FineWeb to thousands of languages.

🥂 FineWeb2 has 8TB of compressed text data and outperforms other multilingual datasets in our experiments.

The dataset is released under the permissive 📜 ODC-By 1.0 license, and the 💻 code to reproduce it and our evaluations is public.

We will very soon announce a big community project, and are working on a 📝 blogpost walking you through the entire dataset creation process. Stay tuned!

In the mean time come ask us question on our chat place: HuggingFaceFW/discussion

H/t @guipenedo @hynky @lvwerra as well as @vsabolcec Bettina Messmer @negar-foroutan and @mjaggi
  • 2 replies
·
thomwolf 
posted an update about 2 months ago
thomwolf 
posted an update about 2 months ago
loubnabnl 
posted an update 2 months ago
view post
Post
2082
Making SmolLM2 reproducible: open-sourcing our training & evaluation toolkit 🛠️ https://github.com/huggingface/smollm/

- Pre-training code with nanotron
- Evaluation suite with lighteval
- Synthetic data generation using distilabel (powers our new SFT dataset HuggingFaceTB/smoltalk)
- Post-training scripts with TRL & the alignment handbook
- On-device tools with llama.cpp for summarization, rewriting & agents

Apache 2.0 licensed. V2 pre-training data mix coming soon!

Which other tools should we add next?
thomwolf 
posted an update 2 months ago
SaylorTwift 
posted an update 2 months ago
thomwolf 
posted an update 2 months ago
thomwolf 
posted an update 3 months ago
view post
Post
4172
Parents in the 1990: Teach the kids to code
Parents now: Teach the kids to fix the code when it starts walking around 🤖✨
  • 2 replies
·