AI & ML interests

None defined yet.

Recent Activity

bigscience-data's activity

meg 
posted an update 5 days ago
view post
Post
2829
💫...And we're live!💫 Seasonal newsletter from ethicsy folks at Hugging Face, exploring the ethics of "AI Agents"
https://huggingface.co/blog/ethics-soc-7
Our analyses found:
- There's a spectrum of "agent"-ness
- *Safety* is a key issue, leading to many other value-based concerns
Read for details & what to do next!
With @evijit , @giadap , and @sasha
yjernite 
posted an update 5 days ago
view post
Post
2039
🤗👤 💻 Speaking of AI agents ...
...Is easier with the right words ;)

My colleagues @meg @evijit @sasha and @giadap just published a wonderful blog post outlining some of the main relevant notions with their signature blend of value-informed and risk-benefits contrasting approach. Go have a read!

https://huggingface.co/blog/ethics-soc-7
albertvillanova 
posted an update 11 days ago
yjernite 
posted an update about 1 month ago
view post
Post
2128
🇪🇺 Policy Thoughts in the EU AI Act Implementation 🇪🇺

There is a lot to like in the first draft of the EU GPAI Code of Practice, especially as regards transparency requirements. The Systemic Risks part, on the other hand, is concerning for both smaller developers and for external stakeholders.

I wrote more on this topic ahead of the next draft. TLDR: more attention to immediate large-scale risks and to collaborative solutions supported by evidence can help everyone - as long as developers disclose sufficient information about their design choices and deployment contexts.

Full blog here, based on our submitted response with @frimelle and @brunatrevelin :

https://huggingface.co/blog/yjernite/eu-draft-cop-risks#on-the-proposed-taxonomy-of-systemic-risks
  • 2 replies
·
lhoestq 
posted an update about 1 month ago
view post
Post
1722
Made a HF Dataset editor a la gg sheets here: lhoestq/dataset-spreadsheets

With Dataset Spreadsheets:
✏️ Edit datasets in the UI
🔗 Share link with collaborators
🐍 Use locally in DuckDB or Python

Available for the 100,000+ parquet datasets on HF :)
thomwolf 
posted an update about 1 month ago
view post
Post
4868
We are proud to announce HuggingFaceFW/fineweb-2: A sparkling update to HuggingFaceFW/fineweb with 1000s of 🗣️languages.

We applied the same data-driven approach that led to SOTA English performance in🍷 FineWeb to thousands of languages.

🥂 FineWeb2 has 8TB of compressed text data and outperforms other multilingual datasets in our experiments.

The dataset is released under the permissive 📜 ODC-By 1.0 license, and the 💻 code to reproduce it and our evaluations is public.

We will very soon announce a big community project, and are working on a 📝 blogpost walking you through the entire dataset creation process. Stay tuned!

In the mean time come ask us question on our chat place: HuggingFaceFW/discussion

H/t @guipenedo @hynky @lvwerra as well as @vsabolcec Bettina Messmer @negar-foroutan and @mjaggi
  • 2 replies
·
christopher 
posted an update about 1 month ago
view post
Post
1614
The folks at Foursquare released a dataset of 104.5 million places of interest ( foursquare/fsq-os-places) and here's all of them on a plot
·
christopher 
posted an update about 1 month ago
thomwolf 
posted an update about 1 month ago
thomwolf 
posted an update about 2 months ago
loubnabnl 
posted an update about 2 months ago
view post
Post
1905
Making SmolLM2 reproducible: open-sourcing our training & evaluation toolkit 🛠️ https://github.com/huggingface/smollm/

- Pre-training code with nanotron
- Evaluation suite with lighteval
- Synthetic data generation using distilabel (powers our new SFT dataset HuggingFaceTB/smoltalk)
- Post-training scripts with TRL & the alignment handbook
- On-device tools with llama.cpp for summarization, rewriting & agents

Apache 2.0 licensed. V2 pre-training data mix coming soon!

Which other tools should we add next?
thomwolf 
posted an update about 2 months ago
albertvillanova 
posted an update about 2 months ago
view post
Post
1574
🚨 How green is your model? 🌱 Introducing a new feature in the Comparator tool: Environmental Impact for responsible #LLM research!
👉 open-llm-leaderboard/comparator
Now, you can not only compare models by performance, but also by their environmental footprint!

🌍 The Comparator calculates CO₂ emissions during evaluation and shows key model characteristics: evaluation score, number of parameters, architecture, precision, type... 🛠️
Make informed decisions about your model's impact on the planet and join the movement towards greener AI!
thomwolf 
posted an update about 2 months ago