Yacine Jernite's picture

Yacine Jernite

yjernite

AI & ML interests

Technical, community, and regulatory tools of AI governance @HuggingFace

Recent Activity

Articles

Organizations

Hugging Face's profile picture Society & Ethics's profile picture BigScience Workshop's profile picture GEM benchmark's profile picture BigScience Catalogue Data's profile picture BigScience Data's profile picture HF Task Exploration's profile picture HuggingFaceM4's profile picture BigCode's profile picture Stable Bias's profile picture Hugging Face H4's profile picture ๐Ÿค— H4 Community's profile picture BigCode Data's profile picture Stable Diffusion Bias Eval's profile picture Librarian Bots's profile picture Blog-explorers's profile picture Evaluating Social Impacts of Generative AI's profile picture llm-values's profile picture Bias Leaderboard Development's profile picture AI Energy Score Project's profile picture Journalists on Hugging Face's profile picture Social Post Explorers's profile picture Frugal AI Challenge's profile picture

yjernite's activity

reacted to meg's post with ๐Ÿ”ฅ 5 days ago
view post
Post
2829
๐Ÿ’ซ...And we're live!๐Ÿ’ซ Seasonal newsletter from ethicsy folks at Hugging Face, exploring the ethics of "AI Agents"
https://huggingface.co/blog/ethics-soc-7
Our analyses found:
- There's a spectrum of "agent"-ness
- *Safety* is a key issue, leading to many other value-based concerns
Read for details & what to do next!
With @evijit , @giadap , and @sasha
posted an update 5 days ago
view post
Post
2039
๐Ÿค—๐Ÿ‘ค ๐Ÿ’ป Speaking of AI agents ...
...Is easier with the right words ;)

My colleagues @meg @evijit @sasha and @giadap just published a wonderful blog post outlining some of the main relevant notions with their signature blend of value-informed and risk-benefits contrasting approach. Go have a read!

https://huggingface.co/blog/ethics-soc-7
reacted to merve's post with ๐Ÿ‘€ about 1 month ago
view post
Post
3348
Apollo is a new family of open-source video language models by Meta, where 3B model outperforms most 7B models and 7B outperforms most 30B models ๐Ÿงถ

โœจ the models come in 1.5B https://huggingface.co/Apollo-LMMs/Apollo-1_5B-t32, 3B https://huggingface.co/Apollo-LMMs/Apollo-3B-t32 and 7B https://huggingface.co/Apollo-LMMs/Apollo-7B-t32 with A2.0 license, based on Qwen1.5 & Qwen2
โœจ the authors also release a benchmark dataset https://huggingface.co/spaces/Apollo-LMMs/ApolloBench

The paper has a lot of experiments (they trained 84 models!) about what makes the video LMs work โฏ๏ธ

Try the demo for best setup here https://huggingface.co/spaces/Apollo-LMMs/Apollo-3B
they evaluate sampling strategies, scaling laws for models and datasets, video representation and more!
> The authors find out that whatever design decision was applied to small models also scale properly when the model and dataset are scaled ๐Ÿ“ˆ scaling dataset has diminishing returns for smaller models
> They evaluate frame sampling strategies, and find that FPS sampling is better than uniform sampling, and they find 8-32 tokens per frame optimal
> They also compare image encoders, they try a variation of models from shape optimized SigLIP to DINOv2
they find google/siglip-so400m-patch14-384 to be most powerful ๐Ÿ”ฅ
> they also compare freezing different parts of models, training all stages with some frozen parts give the best yield

They eventually release three models, where Apollo-3B outperforms most 7B models and Apollo 7B outperforms 30B models ๐Ÿ”ฅ
ยท
reacted to fdaudens's post with ๐Ÿ‘€ about 1 month ago
view post
Post
1335
Did a fun experiment: What are the main themes emerging from the 100+ Nieman Journalism Lab predictions for 2025?

I used natural language processing to cluster and map them โ€” really helps spot patterns that weren't obvious when reading predictions one by one. So what will shape journalism next year? A lot of AI and US politics (surprise!), but there's also this horizontal axis that spans from industry strategies to deep reflections on how to talk to the public.

Click any dot to explore the original prediction. What themes surprise/interest you the most?

๐Ÿ‘‰ fdaudens/nieman_lab_2025_predictions_visualization

P.s.: I discovered that Nieman Lab's content is under Creative Commons license!
posted an update about 1 month ago
view post
Post
2128
๐Ÿ‡ช๐Ÿ‡บ Policy Thoughts in the EU AI Act Implementation ๐Ÿ‡ช๐Ÿ‡บ

There is a lot to like in the first draft of the EU GPAI Code of Practice, especially as regards transparency requirements. The Systemic Risks part, on the other hand, is concerning for both smaller developers and for external stakeholders.

I wrote more on this topic ahead of the next draft. TLDR: more attention to immediate large-scale risks and to collaborative solutions supported by evidence can help everyone - as long as developers disclose sufficient information about their design choices and deployment contexts.

Full blog here, based on our submitted response with @frimelle and @brunatrevelin :

https://huggingface.co/blog/yjernite/eu-draft-cop-risks#on-the-proposed-taxonomy-of-systemic-risks
  • 2 replies
ยท
reacted to dvilasuero's post with โค๏ธ๐Ÿ”ฅ about 1 month ago
view post
Post
2315
๐ŸŒ Announcing Global-MMLU: an improved MMLU Open dataset with evaluation coverage across 42 languages, built with Argilla and the Hugging Face community.

Global-MMLU is the result of months of work with the goal of advancing Multilingual LLM evaluation. It's been an amazing open science effort with collaborators from Cohere For AI, Mila - Quebec Artificial Intelligence Institute, EPFL, Massachusetts Institute of Technology, AI Singapore, National University of Singapore, KAIST, Instituto Superior Tรฉcnico, Carnegie Mellon University, CONICET, and University of Buenos Aires.

๐Ÿท๏ธ +200 contributors used Argilla MMLU questions where regional, dialect, or cultural knowledge was required to answer correctly. 85% of the questions required Western-centric knowledge!

Thanks to this annotation process, the open dataset contains two subsets:

1. ๐Ÿ—ฝ Culturally Agnostic: no specific regional, cultural knowledge is required.
2. โš–๏ธ Culturally Sensitive: requires dialect, cultural knowledge or geographic knowledge to answer correctly.

Moreover, we provide high quality translations of 25 out of 42 languages, thanks again to the community and professional annotators leveraging Argilla on the Hub.

I hope this will ensure a better understanding of the limitations and challenges for making open AI useful for many languages.

Dataset: CohereForAI/Global-MMLU
reacted to fdaudens's post with โค๏ธ about 1 month ago
view post
Post
1081
๐Ÿ“ˆ๐Ÿ‘€ Just dropped: visualization mapping Hugging Face's most liked & downloaded models from 2022 to now. Small models are clearly on the rise - fascinating shift in both likes and download patterns.

Check it out: huggingface/open-source-ai-year-in-review-2024
reacted to AdinaY's post with โค๏ธ about 1 month ago
view post
Post
1485
2023 & 2024 Top Downloaded (all time) Open Models on the hub are both from the Chinese community ๐Ÿ‘€

2023 ๐Ÿ‘‰ Bge base by BAAI
BAAI/bge-base-en-v1.5
2024 ๐Ÿ‘‰ Qwen 2.5 by Alibaba Qwen
Qwen/Qwen2.5-1.5B-Instruct

Canโ€™t wait to see what incredible models the Chinese community will bring in 2025๐Ÿš€

โœจ Follow https://huggingface.co/zh-ai-community to get the latest updates from the Chinese community
โœจ Explore the 2024 Year in Review huggingface/open-source-ai-year-in-review-2024
reacted to cfahlgren1's post with โค๏ธ about 2 months ago
view post
Post
3139
You can clean and format datasets entirely in the browser with a few lines of SQL.

In this post, I replicate the process @mlabonne used to clean the new microsoft/orca-agentinstruct-1M-v1 dataset.

The cleaning process consists of:
- Joining the separate splits together / add split column
- Converting string messages into list of structs
- Removing empty system prompts

https://huggingface.co/blog/cfahlgren1/the-beginners-guide-to-cleaning-a-dataset

Here's his new cleaned dataset: mlabonne/orca-agentinstruct-1M-v1-cleaned
  • 1 reply
ยท
reacted to fdaudens's post with ๐Ÿ”ฅ 2 months ago
view post
Post
1836
Fascinating point from @thomwolf at Web Summit: AI misuse (deepfakes, fake news) is actually easier to make with closed models, not with open-source ones.

This challenges the common narrative that open-source AI is inherently more dangerous. The reality is more nuanced - while we may think open source is technically easier to misuse, closed models' accessibility and product-focused design appear to be driving more actual harm.

Important context for current AI safety discussions and regulation debates.

Do you agree? ๐Ÿ‘‡
  • 1 reply
ยท
reacted to erinys's post with ๐Ÿš€ 3 months ago
reacted to fdaudens's post with โค๏ธ๐Ÿ‘€ 5 months ago
view post
Post
1505
โ€˜AI in the Newsโ€™ of the day:

Anthropic publishes the โ€˜system promptsโ€™ that make Claude tick
- "In its continued effort to paint itself as a more ethical, transparent AI vendor, Anthropic has published the system prompts for its latest models"
- They specify that โ€œClaude cannot open URLs, links, or videos, perform facial recognition or identify or name any humans in photos.
- "Anthropic is exerting pressure on competitors to publish the same. Weโ€™ll have to see if the gambit works."
https://techcrunch.com/2024/08/26/anthropic-publishes-the-system-prompt-that-makes-claude-tick/

Chinaโ€™s tech giants splash out on AI despite US restrictions (paywall)
- "Alibaba, Tencent and Baidu had combined capital expenditure of Rmb50bn ($7bn) in the first half, compared with Rmb23bn a year earlier. TikTok parent ByteDance (which is private) has also increased AI-related spending"
- Nvidia's H100 and upcoming Blackwell series are under US restrictions, but Chinaโ€™s tech giants can buy H20
- Analysts expect Nvidia to ship more than 1mn of the processors to Chinese tech groups in the coming months.
https://www.ft.com/content/31bffc48-2ca7-472b-9d53-3deaad2d86ce

MZ "said it was improper for the Biden administration to have pressured Facebook to censor content in 2021 related to the coronavirus pandemic"
- "At the time, Facebookโ€™s publicly stated goal was to push millions of people toward Covid-19 vaccines. In his letter, Zuckerberg didnโ€™t indicate whether he had changed his mind about that goal"
https://www.wsj.com/tech/mark-zuckerberg-neutral-politics-letter-election-2024-02b86372

Food for thought:
- Why donโ€™t women use artificial intelligence?
https://www.economist.com/finance-and-economics/2024/08/21/why-dont-women-use-artificial-intelligence
- Most AI avatars look female, young and attractive. Are they a passing trend or here to stay?
https://reutersinstitute.politics.ox.ac.uk/news/most-ai-avatars-look-female-young-and-attractive-are-they-passing-trend-or-here-stay
reacted to clem's post with ๐Ÿ”ฅ 5 months ago
view post
Post
4134
Just crossed 200,000 free public AI datasets shared by the community on Hugging Face! Text, image, video, audio, time-series & many more... Thanks everyone!

http://hf.co/datasets
reacted to lunarflu's post with ๐Ÿ”ฅ 6 months ago
view post
Post
1889
Cool things this week from @huggingface !

๐ŸŒŽAI math olympiad winner NuminaMath is here!
๐Ÿค—Announcing New Hugging Face and Keras NLP integration
โœจUI overhaul to HF tokens!
๐ŸงŠ Embed our dataset viewer on any webpage!

https://huggingface.co/blog/winning-aimo-progress-prize
https://huggingface.co/blog/keras-nlp-integration
https://huggingface.co/settings/tokens
https://x.com/julien_c/status/1812099420726456457

Check out the full list on our discord! ๐Ÿ‘‡
https://discord.com/invite/JfAtkvEtRb
reacted to fdaudens's post with โค๏ธ๐Ÿš€๐Ÿค๐Ÿ”ฅ 6 months ago
view post
Post
3308
Small models, BIG impact: SmolLM is here! ๐Ÿš€๐Ÿ”ฌ

We're launching a series of small but mighty language models:
๐ŸŽ๏ธ Super fast - runs on laptops, phones, you name it!
๐Ÿ“ 3 sizes: 130M, 350M, and 1.5B parameters
๐Ÿฅ‡ Outperforms same size models from Meta, Microsoft, and Qwen
๐Ÿ”“ Fully open-source: datasets, training code, models

๐Š๐ž๐ฒ ๐Ÿ๐ž๐š๐ญ๐ฎ๐ซ๐ž๐ฌ
- Trained on FineWeb-Edu and Cosmopedia v2 (largest synthetic pre-training dataset)
- No cloud needed - run locally for privacy and energy efficiency
- Everything is public, from data curation to training steps

๐๐จ๐ญ๐ž๐ง๐ญ๐ข๐š๐ฅ ๐ฎ๐ฌ๐ž ๐œ๐š๐ฌ๐ž๐ฌ
- On-device autocomplete
- Local request parsing
- Custom fine-tuning for specific needs without the need for expensive GPUs

๐†๐จ ๐๐ž๐ž๐ฉ๐ž๐ซ
๐Ÿ‘‰ Check it out: https://huggingface.co/collections/HuggingFaceTB/smollm-models-6695016cad7167254ce15966
๐Ÿ‘‰ Run the 360M model in your browser, 100 % private: HuggingFaceTB/SmolLM-360M-Instruct-WebGPU
๐Ÿ‘‰ Read the blog explaining everything in detail: huggingface.co/blog/smollm

Kudos to the stellar team who worked on this project: @loubnabnl @anton-l @eliebak @lvwerra