Open RL Leaderboard

Activity Feed Request to join this org

AI & ML interests

None defined yet.

Recent Activity

ClementRomac authored a paper 27 days ago

Meta Automatic Curriculum Learning

ClementRomac authored a paper 27 days ago

SAC-GLAM: Improving Online RL for LLM agents with Soft Actor-Critic and Hindsight Relabeling

ClementRomac authored a paper 27 days ago

MAGELLAN: Metacognitive predictions of learning progress guide autotelic LLM agents in large goal spaces

View all activity

open-rl-leaderboard's activity

Aurelien-Morgan

posted an update 7 days ago

Post

3092

The Almighty function-caller

How would you like to build smart GenAi infrastructure ?
Give extensive tools memory to your edge agentic system,
And optimize the resources it takes to run yet a high-performance set of agents ?

We came up with a novel approach to function-calling at scale for smart companies and corporate-grade use-cases.

Read our full-fledged blog article on this here on Hugging Face :
https://huggingface.co/blog/Aurelien-Morgan/the-almighty-function-caller

Aurelien-Morgan

posted an update 8 days ago

Post

644

retrain-pipelines 0.1.2 finally dropped. It comes with a hot Hugging Face Hub integration. Go check it out. We have 2 articles about it coming up. One already fully written so, be on the lookout !
@retrain-pipelines

Also, I'll be volunteering at GOSIM AI Paris 2025. If you're interested in chatting, hmu.

ClementRomac

authored 4 papers 27 days ago

Meta Automatic Curriculum Learning

Paper • 2011.08463 • Published Nov 16, 2020

SAC-GLAM: Improving Online RL for LLM agents with Soft Actor-Critic and Hindsight Relabeling

Paper • 2410.12481 • Published Oct 16, 2024

MAGELLAN: Metacognitive predictions of learning progress guide autotelic LLM agents in large goal spaces

Paper • 2502.07709 • Published Feb 11

Reinforcement Learning for Aligning Large Language Models Agents with Interactive Environments: Quantifying and Mitigating Prompt Overfitting

Paper • 2410.19920 • Published Oct 25, 2024

Aurelien-Morgan

posted an update about 1 month ago

Post

1990

Almost there !
https://test.pypi.org/project/test-010-retrain-pipelines/

clefourrier

posted an update about 2 months ago

Post

2338

Gemma3 family is out! Reading the tech report, and this section was really interesting to me from a methods/scientific fairness pov.

Instead of doing over-hyped comparisons, they clearly state that **results are reported in a setup which is advantageous to their models**.
(Which everybody does, but people usually don't say)

For a tech report, it makes a lot of sense to report model performance when used optimally!
On leaderboards on the other hand, comparison will be apples to apples, but in a potentially unoptimal way for a given model family (like some user interact sub-optimally with models)

Also contains a cool section (6) on training data memorization rate too! Important to see if your model will output the training data it has seen as such: always an issue for privacy/copyright/... but also very much for evaluation!

Because if your model knows its evals by heart, you're not testing for generalization.

clefourrier

authored a paper 3 months ago

SmolLM2: When Smol Goes Big -- Data-Centric Training of a Small Language Model

Paper • 2502.02737 • Published Feb 4 • 229

qgallouedec

updated a dataset 5 months ago

open-rl-leaderboard/results_v2

Viewer • Updated Dec 7, 2024 • 93.4M • 1.28k • 1

clefourrier

authored a paper 5 months ago

Global MMLU: Understanding and Addressing Cultural and Linguistic Biases in Multilingual Evaluation

Paper • 2412.03304 • Published Dec 4, 2024 • 19

Aurelien-Morgan

posted an update 6 months ago

Post

506

I just shipped retrain-pipelines 0.1.1 today. The doc is also pimped compared to previous release. That was clearly not mature then.
I'll have to focus on another project for the next couple weeks but, anyone feel free to open issues on the GitHub repo and discuss any interest you'd have there if you will (please?) !
In the meantime, you may enjoy retrying this :
https://huggingface.co/blog/Aurelien-Morgan/stateful-metaflow-on-colab

Aurelien-Morgan

posted an update 7 months ago

Post

560

I just published the first article in a pair. I could make it a longer tailed series, in case you liked em. This one dives into self-hosting Metaflow without needing S3, illustrated with a version tailored for Google Colab.
find it @ https://huggingface.co/blog/Aurelien-Morgan/stateful-metaflow-on-colab

2 replies

clefourrier

authored 2 papers 10 months ago

The Hallucinations Leaderboard -- An Open Effort to Measure Hallucinations in Large Language Models

Paper • 2404.05904 • Published Apr 8, 2024 • 9

GAIA: a benchmark for General AI Assistants

Paper • 2311.12983 • Published Nov 21, 2023 • 206

ClementRomac

authored a paper about 1 year ago

Jack of All Trades, Master of Some, a Multi-Purpose Transformer Agent

Paper • 2402.09844 • Published Feb 15, 2024 • 21

clefourrier

posted an update about 1 year ago

Post

6133

In a basic chatbots, errors are annoyances. In medical LLMs, errors can have life-threatening consequences 🩸

It's therefore vital to benchmark/follow advances in medical LLMs before even thinking about deployment.

This is why a small research team introduced a medical LLM leaderboard, to get reproducible and comparable results between LLMs, and allow everyone to follow advances in the field.

openlifescienceai/open_medical_llm_leaderboard

Congrats to @aaditya and @pminervini !
Learn more in the blog: https://huggingface.co/blog/leaderboard-medicalllm

clefourrier

posted an update about 1 year ago

Post

4772

Contamination free code evaluations with LiveCodeBench! 🖥️

LiveCodeBench is a new leaderboard, which contains:
- complete code evaluations (on code generation, self repair, code execution, tests)
- my favorite feature: problem selection by publication date 📅

This feature means that you can get model scores averaged only on new problems out of the training data. This means... contamination free code evals! 🚀

Check it out!

Blog: https://huggingface.co/blog/leaderboard-livecodebench
Leaderboard: livecodebench/leaderboard

Congrats to @StringChaos @minimario @xu3kev @kingh0730 and @FanjiaYan for the super cool leaderboard!

clefourrier

posted an update about 1 year ago

Post

2247

🆕 Evaluate your RL agents - who's best at Atari?🏆

The new RL leaderboard evaluates agents in 87 possible environments (from Atari 🎮 to motion control simulations🚶and more)!

When you submit your model, it's run and evaluated in real time - and the leaderboard displays small videos of the best model's run, which is super fun to watch! ✨

Kudos to @qgallouedec for creating and maintaining the leaderboard!
Let's find out which agent is the best at games! 🚀

open-rl-leaderboard/leaderboard

clefourrier

posted an update about 1 year ago

Post

2244

Fun fact about evaluation, part 2!

How much do scores change depending on prompt format choice?

Using different prompts (all present in the literature, from Prompt question? to Question: prompt question?\nChoices: enumeration of all choices\nAnswer: ), we get a score range of...

10 points for a single model!
Keep in mind that we only changed the prompt, not the evaluation subsets, etc.
Again, this confirms that evaluation results reported without their details are basically bullshit.

Prompt format on the x axis, all these evals look at the logprob of either "choice A/choice B..." or "A/B...".

Incidentally, it also changes model rankings - so a "best" model might only be best on one type of prompt...

AI & ML interests

Recent Activity

Team members 6

open-rl-leaderboard's activity