BigCode

Enterprise

non-profit

https://www.bigcode-project.org/

BigCodeProject

bigcode-project

Activity Feed Request to join this org

AI & ML interests

None defined yet.

Recent Activity

terryyz updated a dataset about 9 hours ago

bigcode/arena-log-test

terryyz published a dataset 1 day ago

bigcode/arena-log-test

terryyz updated a Space 4 days ago

bigcode/arena

View all activity

Articles

BigCodeBench: Benchmarking Large Language Models on Solving Practical and Challenging Programming Tasks

Jun 18, 2024

• 52

terryyz

updated a dataset about 9 hours ago

bigcode/arena-log-test

Viewer • Updated about 9 hours ago • 10 • 78

terryyz

published a dataset 1 day ago

bigcode/arena-log-test

Viewer • Updated about 9 hours ago • 10 • 78

terryyz

updated a Space 4 days ago

BigCodeArena

🚀

Compare code outputs from two AI models

afaji

authored a paper 11 days ago

Predicting the Order of Upcoming Tokens Improves Language Modeling

Paper • 2508.19228 • Published 13 days ago • 21

joelniklaus

authored 5 papers 19 days ago

INCLUDE: Evaluating Multilingual Language Understanding with Regional Knowledge

Paper • 2411.19799 • Published Nov 29, 2024 • 14

LEXam: Benchmarking Legal Reasoning on 340 Law Exams

Paper • 2505.12864 • Published May 19 • 2

Parity-Aware Byte-Pair Encoding: Improving Cross-lingual Fairness in Tokenization

Paper • 2508.04796 • Published Aug 6

From Citations to Criticality: Predicting Legal Decision Influence in the Multilingual Swiss Jurisprudence

Paper • 2410.13460 • Published Oct 17, 2024

Unlocking Legal Knowledge: A Multilingual Dataset for Judicial Summarization in Switzerland

Paper • 2410.13456 • Published Oct 17, 2024

yjernite

authored 3 papers about 1 month ago

The Responsible Foundation Model Development Cheatsheet: A Review of Tools & Resources

Paper • 2406.16746 • Published Jun 24, 2024

In-House Evaluation Is Not Enough: Towards Robust Third-Party Flaw Disclosure for General-Purpose AI

Paper • 2503.16861 • Published Mar 21 • 1

A Different Approach to AI Safety: Proceedings from the Columbia Convening on Openness in Artificial Intelligence and AI Safety

Paper • 2506.22183 • Published Jun 27

yjernite

posted an update about 1 month ago

Post

4125

𝗙𝗶𝗿𝘀𝘁 𝗚𝗣𝗔𝗜 𝗠𝗼𝗱𝗲𝗹 𝘄𝗶𝘁𝗵 𝗘𝗨 𝗗𝗮𝘁𝗮 𝗧𝗿𝗮𝗻𝘀𝗽𝗮𝗿𝗲𝗻𝗰𝘆 𝗧𝗲𝗺𝗽𝗹𝗮𝘁𝗲? 🇪🇺

With the release of the EU data transparency template this week, we finally got to see one of the most meaningful artifacts to come out of the AI Act implementation so far (haven't you heard? AI's all about the data! 📊📚)

The impact of the template will depend on how effectively it establishes a minimum meaningful transparency standard for companies that don't otherwise offer any transparency into their handling of e.g. personal data or (anti?-)competitive practices in commercial licensing - we'll see how those play out as new models are released after August 2nd 👀

In the meantime, I wanted to see how the template works for a fully open-source + commercially viable model, so I filled it out for the SmolLM3 - which my colleagues at Hugging Face earlier this month 🤗 ICYMI, it's fully open-source with 3B parameters and performance matching the best similar-size models (I've switched all my local apps from Qwen3 to it, you should too 💡)

Verdict: congrats to the European Commission AI Office for making it so straightforward! Fully open and transparent models remain a cornerstone of informed regulation and governance, but the different organizational needs of their developers aren't always properly accounted for in new regulation. In this case, it took me all of two hours to fill out and publish the template (including reading the guidelines) - so kudos for making it feasible for smaller and distributed organizations 🙌 Definitely a step forward for transparency 🔍

To learn more have a look at:

- The SmolLM3 model: HuggingFaceTB/SmolLM3-3B
- Its filled out Public Summary of Training Content: hfmlsoc/smollm3-eu-data-transparency
- And if you're interested, some previous remarks on regulatory minimum meaningful standards for data disclosure: https://huggingface.co/blog/yjernite/naiac-data-transparency

rootacess

authored a paper 2 months ago

Robust Learning of Diverse Code Edits

Paper • 2503.03656 • Published Mar 5 • 3

RTT1

authored a paper 2 months ago

Agent KB: Leveraging Cross-Domain Experience for Agentic Problem Solving

Paper • 2507.06229 • Published Jul 8 • 73

gagan3012

authored a paper 3 months ago

Leveraging Vision-Language Pre-training for Human Activity Recognition in Still Images

Paper • 2506.13458 • Published Jun 16

yjernite

posted an update 3 months ago

Post

2084

Congrats to the top trending dataset institutional/institutional-books-1.0 !

This is a fantastic example of large-scale curation of public domain books with intentional governance for AI research and use - definitely recommend checking it out, experimenting with the metadata ( institutional/institutional-books-1.0-metadata), and starting to build on top of it 🤗