
CGIAR
Enterprise
non-profit
Verified
AI & ML interests
None defined yet.
Recent Activity
Articles
CGIAR's activity
Post
2065
๐๐ฏ๐๐ผ๐น๐๐๐ฒ ๐ญ๐ฒ๐ฟ๐ผ: ๐๐๐ ๐ ๐ฐ๐ฎ๐ป ๐๐ฟ๐ฎ๐ถ๐ป ๐๐ถ๐๐ต๐ผ๐๐ ๐ฎ๐ป๐ ๐ฒ๐
๐๐ฒ๐ฟ๐ป๐ฎ๐น ๐ฑ๐ฎ๐๐ฎ ๐คฏ
Has the "data wall" just been breached?
Recent RL paradigms often relied on a set of questions an answers that needs to be manually curated. Researchers from Tsinghua University went like "why though".
๐ค Indeed, why learn from question designed by a human teacher, when the model can start from their base knowledge and learn by experimenting in a code environment, proposing coding tasks themselves and trying to solve them?
Thus they created โAbsolute Zero Reasoningโ (AZR), an approach that removes any need for human curated data.
๐ญ ๐๐๐ฎ๐น ๐ฟ๐ผ๐น๐ฒ๐:
โฃ Proposer: Generates challenging but solvable coding tasks
โฃ Solver: Attempts to solve those self-proposed tasks
๐งช ๐ง๐ต๐ฟ๐ฒ๐ฒ ๐๐ฎ๐๐ธ ๐๐๐ฝ๐ฒ๐: all types are defined as triplets of program, input and output
โฃ Deduction: Give model an input and program, it must deduce the output
โฃ Abduction: Give model an program and output, it must find the input that gave said output
โฃ Induction: Synthesize a program from input/output pairs
Btw this reminded me of my long-forgotten philosophy classes: Aristotle was more on the induction side, learning from real-world analogies, while Plato was more on the deduction side, trying to progress quite far with just one input and his reasoning.
๐ ๐ฅ๐ฒ๐๐๐น๐๐:
โฃ AZR post-training creates a nice improvement on known models like Qwen2.5-7B
โฃ Shows strong cross-domain transfer: coding โ๏ธ math reasoning
๐ง ๐ข๐๐ต๐ฒ๐ฟ ๐ณ๐ถ๐ป๐ฑ๐ถ๐ป๐ด๐:
โฃ Having a better base performance (general or code specific) amplify the gains from Absolute Zero Reasoning
โฃ Researchers warn about "Uh-oh moments" (winking to the "aha moments" of DeepSeek) where the model generates concerning goals like "make an extremely convoluted code to outsmart all these humans": so supervision is still needed!
Paper here: Absolute Zero: Reinforced Self-play Reasoning with Zero Data (2505.03335)
Has the "data wall" just been breached?
Recent RL paradigms often relied on a set of questions an answers that needs to be manually curated. Researchers from Tsinghua University went like "why though".
๐ค Indeed, why learn from question designed by a human teacher, when the model can start from their base knowledge and learn by experimenting in a code environment, proposing coding tasks themselves and trying to solve them?
Thus they created โAbsolute Zero Reasoningโ (AZR), an approach that removes any need for human curated data.
๐ญ ๐๐๐ฎ๐น ๐ฟ๐ผ๐น๐ฒ๐:
โฃ Proposer: Generates challenging but solvable coding tasks
โฃ Solver: Attempts to solve those self-proposed tasks
๐งช ๐ง๐ต๐ฟ๐ฒ๐ฒ ๐๐ฎ๐๐ธ ๐๐๐ฝ๐ฒ๐: all types are defined as triplets of program, input and output
โฃ Deduction: Give model an input and program, it must deduce the output
โฃ Abduction: Give model an program and output, it must find the input that gave said output
โฃ Induction: Synthesize a program from input/output pairs
Btw this reminded me of my long-forgotten philosophy classes: Aristotle was more on the induction side, learning from real-world analogies, while Plato was more on the deduction side, trying to progress quite far with just one input and his reasoning.
๐ ๐ฅ๐ฒ๐๐๐น๐๐:
โฃ AZR post-training creates a nice improvement on known models like Qwen2.5-7B
โฃ Shows strong cross-domain transfer: coding โ๏ธ math reasoning
๐ง ๐ข๐๐ต๐ฒ๐ฟ ๐ณ๐ถ๐ป๐ฑ๐ถ๐ป๐ด๐:
โฃ Having a better base performance (general or code specific) amplify the gains from Absolute Zero Reasoning
โฃ Researchers warn about "Uh-oh moments" (winking to the "aha moments" of DeepSeek) where the model generates concerning goals like "make an extremely convoluted code to outsmart all these humans": so supervision is still needed!
Paper here: Absolute Zero: Reinforced Self-play Reasoning with Zero Data (2505.03335)
Post
4213
I've made an open version of Google's NotebookLM, and it shows the superiority of the open source tech task! ๐ช
The app's workflow is simple. Given a source PDF or URL, it extracts the content from it, then tasks Meta's Llama 3.3-70B with writing the podcast script, with a good prompt crafted by @gabrielchua ("two hosts, with lively discussion, fun notes, insightful question etc.")
Then it hands off the text-to-speech conversion to Kokoro-82M, and there you go, you have two hosts discussion any article.
The generation is nearly instant, because:
> Llama 3.3 70B is running at 1,000 tokens/seconds with Cerebras inference
> The audio is generated in streaming mode by the tiny (yet powerful) Kokoro, generating voices faster than real-time.
And the audio generation runs for free on Zero GPUs, hosted by HF on H200s.
Overall, open source solutions rival the quality of closed-source solutions at close to no cost!
Try it here ๐๐ m-ric/open-notebooklm
The app's workflow is simple. Given a source PDF or URL, it extracts the content from it, then tasks Meta's Llama 3.3-70B with writing the podcast script, with a good prompt crafted by @gabrielchua ("two hosts, with lively discussion, fun notes, insightful question etc.")
Then it hands off the text-to-speech conversion to Kokoro-82M, and there you go, you have two hosts discussion any article.
The generation is nearly instant, because:
> Llama 3.3 70B is running at 1,000 tokens/seconds with Cerebras inference
> The audio is generated in streaming mode by the tiny (yet powerful) Kokoro, generating voices faster than real-time.
And the audio generation runs for free on Zero GPUs, hosted by HF on H200s.
Overall, open source solutions rival the quality of closed-source solutions at close to no cost!
Try it here ๐๐ m-ric/open-notebooklm
Post
2781
New king of open VLMs: InternVL3 takes Qwen 2.5's crown! ๐
InternVL have been a wildly successful series of model : and the latest iteration has just taken back their crown thanks to their superior, natively multimodal vision training pipeline.
โก๏ธ Most of the vision language models (VLMs) these days are built like Frankenstein : take a good text-only Large Language Model (LLM) backbone, stitch a specific vision transformer (ViT) on top of it. Then the training is sequential ๐ข : 1. Freeze the LLM weights while you train the ViT only to work with the LLM part, then 2. Unfreeze all weights to train all weights in order to work together.
๐ซ The Shanghai Lab decided to challenge this paradigm and chose this approach that they call "native". For each of their model sizes, they still start from a good LLM (mostly Qwen-2.5 series, did I tell you I'm a huge fan of Qwen? โค๏ธ), and stitch the ViT, but they don't freeze anything : they train all weights together with interleaved text and image understanding data in a single pre-training phase ๐จ.
They claim it results in more seamless interactions between modalities. And the results prove them right: they took the crown of top VLMs, at nearly all sizes, from their Qwen-2.5 parents. ๐
InternVL have been a wildly successful series of model : and the latest iteration has just taken back their crown thanks to their superior, natively multimodal vision training pipeline.
โก๏ธ Most of the vision language models (VLMs) these days are built like Frankenstein : take a good text-only Large Language Model (LLM) backbone, stitch a specific vision transformer (ViT) on top of it. Then the training is sequential ๐ข : 1. Freeze the LLM weights while you train the ViT only to work with the LLM part, then 2. Unfreeze all weights to train all weights in order to work together.
๐ซ The Shanghai Lab decided to challenge this paradigm and chose this approach that they call "native". For each of their model sizes, they still start from a good LLM (mostly Qwen-2.5 series, did I tell you I'm a huge fan of Qwen? โค๏ธ), and stitch the ViT, but they don't freeze anything : they train all weights together with interleaved text and image understanding data in a single pre-training phase ๐จ.
They claim it results in more seamless interactions between modalities. And the results prove them right: they took the crown of top VLMs, at nearly all sizes, from their Qwen-2.5 parents. ๐
JLBKMVEย
updated
a
Space
30 days ago
JLBKMVEย
published
a
Space
30 days ago
Post
2352
๐ DeepSeek R1 moment has come for GUI agents: Rule-based Reinforcement Learning gives better results than SFT with 500x smaller datasets!
Traditionally (by which I mean "in the last few months"), GUI agents have been trained with supervised fine-tuning (SFT). This meant, collecting huge datasets of screen captures from people using computers, and using these to fine-tune your model. ๐
๐ But last week, a new paper introduced UI-R1, applying DeepSeek's R1-style rule-based reinforcement learning (RL) specifically to GUI action prediction tasks.
This is big news: with RL, maybe we could build good agents without the need for huge datasets.
UI-R1 uses a unified reward function that evaluates multiple responses from models, optimizing via policy algorithms like Group Relative Policy Optimization (GRPO).
Specifically, the reward function assesses:
๐ฏ Action type accuracy: Does the predicted action match the ground truth?
๐ Coordinate accuracy (specifically for clicks): Is the predicted click within the correct bounding box?
๐ Output format: Does the model clearly articulate both its reasoning and final action?
Using just 136 carefully selected mobile tasksโcompared to 76,000 tasks for larger models like OS-AtlasโUI-R1 shows significant efficiency and improved performance:
๐ Boosted action prediction accuracy from 76% to 89% on AndroidControl.
๐ Outperformed larger, SFT-trained models (e.g., OS-Atlas-7B), demonstrating superior results with vastly fewer data points (136 tasks vs. 76K).
๐ Enhanced adaptability and generalization, excelling even in out-of-domain scenarios.
The paper tests this RL-based method only in low-level GUI tasks. Could it generalize to more complex interactions? ๐ง
Read the full paper here ๐ UI-R1: Enhancing Action Prediction of GUI Agents by Reinforcement Learning (2503.21620)
Traditionally (by which I mean "in the last few months"), GUI agents have been trained with supervised fine-tuning (SFT). This meant, collecting huge datasets of screen captures from people using computers, and using these to fine-tune your model. ๐
๐ But last week, a new paper introduced UI-R1, applying DeepSeek's R1-style rule-based reinforcement learning (RL) specifically to GUI action prediction tasks.
This is big news: with RL, maybe we could build good agents without the need for huge datasets.
UI-R1 uses a unified reward function that evaluates multiple responses from models, optimizing via policy algorithms like Group Relative Policy Optimization (GRPO).
Specifically, the reward function assesses:
๐ฏ Action type accuracy: Does the predicted action match the ground truth?
๐ Coordinate accuracy (specifically for clicks): Is the predicted click within the correct bounding box?
๐ Output format: Does the model clearly articulate both its reasoning and final action?
Using just 136 carefully selected mobile tasksโcompared to 76,000 tasks for larger models like OS-AtlasโUI-R1 shows significant efficiency and improved performance:
๐ Boosted action prediction accuracy from 76% to 89% on AndroidControl.
๐ Outperformed larger, SFT-trained models (e.g., OS-Atlas-7B), demonstrating superior results with vastly fewer data points (136 tasks vs. 76K).
๐ Enhanced adaptability and generalization, excelling even in out-of-domain scenarios.
The paper tests this RL-based method only in low-level GUI tasks. Could it generalize to more complex interactions? ๐ง
Read the full paper here ๐ UI-R1: Enhancing Action Prediction of GUI Agents by Reinforcement Learning (2503.21620)
Post
4990
smolagents now support vLLM! ๐ฅณ
As one of the most popular local inference solutions, the community had been asking us to integrate vLLM: after a heavy refactoring of our LLM classes, we've just released smolagents 1.11.0, with a brand new VLLMModel class.
Go try it and tell us what you think!
https://github.com/huggingface/smolagents/blob/45b2c86857b7f7657daaa74e4d17d347e9e2c4a4/src/smolagents/models.py#L497
As one of the most popular local inference solutions, the community had been asking us to integrate vLLM: after a heavy refactoring of our LLM classes, we've just released smolagents 1.11.0, with a brand new VLLMModel class.
Go try it and tell us what you think!
https://github.com/huggingface/smolagents/blob/45b2c86857b7f7657daaa74e4d17d347e9e2c4a4/src/smolagents/models.py#L497
Post
1071
Our new Agentic leaderboard is now live!๐ฅ
If you ever asked which LLM is best for powering agents, we've just made a leaderboard that ranks them all! Built with @albertvillanova , this ranks LLMs powering a smolagents CodeAgent on subsets of various benchmarks. โ
๐ GPT-4.5 comes on top, even beating reasoning models like DeepSeek-R1 or o1. And Claude-3.7-Sonnet is a close second!
The leaderboard also allows you to show the scores of vanilla LLMs (without any agentic setup) on the same benchmarks: this shows the huge improvements brought by agentic setups. ๐ช
(Note that results will be added manually, so the leaderboard might not always have the latest LLMs)
If you ever asked which LLM is best for powering agents, we've just made a leaderboard that ranks them all! Built with @albertvillanova , this ranks LLMs powering a smolagents CodeAgent on subsets of various benchmarks. โ
๐ GPT-4.5 comes on top, even beating reasoning models like DeepSeek-R1 or o1. And Claude-3.7-Sonnet is a close second!
The leaderboard also allows you to show the scores of vanilla LLMs (without any agentic setup) on the same benchmarks: this shows the huge improvements brought by agentic setups. ๐ช
(Note that results will be added manually, so the leaderboard might not always have the latest LLMs)
feedcomposerย
updated
a
dataset
3 months ago
sleeperscioย
updated
a
dataset
3 months ago

apsourgย
updated
a
dataset
3 months ago
Post
4850
We now have a Deep Research for academia: SurveyX automatically writes academic surveys nearly indistinguishable from human-written ones ๐ฅ
Researchers from Beijing and Shanghai just published the first application of a deep research system to academia: their algorithm, given a question, can give you a survey of all papers on the subject.
To make a research survey, you generally follow two steps, preparation (collect and organize papers) and writing (outline creation, writing, polishing). Researchers followed the same two steps and automated them.
๐ฏ For the preparation part, a key part is find all the important references on the given subject.
Researchers first cast a wide net of all relevant papers. But then finding the really important ones is like distilling knowledge from a haystack of information. To solve this challenge, they built an โAttributeTreeโ object that structures key information from citations. Ablating these AttributeTrees significantly decreased structure and synthesis scores, so they were really useful!
๐ For the writing part, key was to get a synthesis that's both short and true. This is not easy to get with LLMs! So they used methods like LLM-based deduplication to shorten the too verbose listings made by LLMs, and RAG to grab original quotes instead of made-up ones.
As a result, their system outperforms previous approaches by far!
As assessed by LLM-judges, the quality score os SurveyX even approaches this of human experts, with 4.59/5 vs 4.75/5 ๐
I advise you to read the paper, it's a great overview of the kind of assistants that we'll get in the short future! ๐ SurveyX: Academic Survey Automation via Large Language Models (2502.14776)
Their website shows examples of generated surveys ๐ http://www.surveyx.cn/
Researchers from Beijing and Shanghai just published the first application of a deep research system to academia: their algorithm, given a question, can give you a survey of all papers on the subject.
To make a research survey, you generally follow two steps, preparation (collect and organize papers) and writing (outline creation, writing, polishing). Researchers followed the same two steps and automated them.
๐ฏ For the preparation part, a key part is find all the important references on the given subject.
Researchers first cast a wide net of all relevant papers. But then finding the really important ones is like distilling knowledge from a haystack of information. To solve this challenge, they built an โAttributeTreeโ object that structures key information from citations. Ablating these AttributeTrees significantly decreased structure and synthesis scores, so they were really useful!
๐ For the writing part, key was to get a synthesis that's both short and true. This is not easy to get with LLMs! So they used methods like LLM-based deduplication to shorten the too verbose listings made by LLMs, and RAG to grab original quotes instead of made-up ones.
As a result, their system outperforms previous approaches by far!
As assessed by LLM-judges, the quality score os SurveyX even approaches this of human experts, with 4.59/5 vs 4.75/5 ๐
I advise you to read the paper, it's a great overview of the kind of assistants that we'll get in the short future! ๐ SurveyX: Academic Survey Automation via Large Language Models (2502.14776)
Their website shows examples of generated surveys ๐ http://www.surveyx.cn/
Post
3107
Less is More for Reasoning (LIMO): a 32B model fine-tuned with 817 examples can beat o1-preview on math reasoning! ๐คฏ
Do we really need o1's huge RL procedure to see reasoning emerge? It seems not.
Researchers from Shanghai Jiaotong University just demonstrated that carefully selected examples can boost math performance in large language models using SFT โno huge datasets or RL procedures needed.
Their procedure allows Qwen2.5-32B-Instruct to jump from 6.5% to 57% on AIME and from 59% to 95% on MATH, while using only 1% of the data in previous approaches.
โก The Less-is-More Reasoning Hypothesis:
โฃ Minimal but precise examples that showcase optimal reasoning patterns matter more than sheer quantity
โฃ Pre-training knowledge plus sufficient computational resources at inference levels up math skills
โก๏ธ Core techniques:
โฃ High-quality reasoning chains with self-verification steps
โฃ 817 handpicked problems that encourage deeper reasoning
โฃ Enough inference-time computation to allow extended reasoning
๐ช Efficiency gains:
โฃ Only 817 examples instead of 100k+
โฃ 40.5% absolute improvement across 10 diverse benchmarks, outperforming models trained on 100x more data
This really challenges the notion that SFT leads to memorization rather than generalization! And opens up reasoning to GPU-poor researchers ๐
Read the full paper here ๐ย LIMO: Less is More for Reasoning (2502.03387)
Do we really need o1's huge RL procedure to see reasoning emerge? It seems not.
Researchers from Shanghai Jiaotong University just demonstrated that carefully selected examples can boost math performance in large language models using SFT โno huge datasets or RL procedures needed.
Their procedure allows Qwen2.5-32B-Instruct to jump from 6.5% to 57% on AIME and from 59% to 95% on MATH, while using only 1% of the data in previous approaches.
โก The Less-is-More Reasoning Hypothesis:
โฃ Minimal but precise examples that showcase optimal reasoning patterns matter more than sheer quantity
โฃ Pre-training knowledge plus sufficient computational resources at inference levels up math skills
โก๏ธ Core techniques:
โฃ High-quality reasoning chains with self-verification steps
โฃ 817 handpicked problems that encourage deeper reasoning
โฃ Enough inference-time computation to allow extended reasoning
๐ช Efficiency gains:
โฃ Only 817 examples instead of 100k+
โฃ 40.5% absolute improvement across 10 diverse benchmarks, outperforming models trained on 100x more data
This really challenges the notion that SFT leads to memorization rather than generalization! And opens up reasoning to GPU-poor researchers ๐
Read the full paper here ๐ย LIMO: Less is More for Reasoning (2502.03387)
Post
2958
๐๐ฟ๐ฒ๐ฎ๐ ๐ณ๐ฒ๐ฎ๐๐๐ฟ๐ฒ ๐ฎ๐น๐ฒ๐ฟ๐: you can now share agents to the Hub! ๐ฅณ๐ฅณ
And any agent pushed to Hub get a cool Space interface to directly chat with it.
This was a real technical challenge: for instance, serializing tools to export them meant that you needed to get all the source code for a tool, verify that it was standalone (not relying on external variables), and gathering all the packages required to make it run.
Go try it out! ๐ https://github.com/huggingface/smolagents
And any agent pushed to Hub get a cool Space interface to directly chat with it.
This was a real technical challenge: for instance, serializing tools to export them meant that you needed to get all the source code for a tool, verify that it was standalone (not relying on external variables), and gathering all the packages required to make it run.
Go try it out! ๐ https://github.com/huggingface/smolagents
Post
2557
For those who haven't come across it yet, here's a handy trick to discuss an entire GitHub repo with an LLM:
=> Just replace "github" with "gitingest" in the url, and you get the whole repo as a single string that you can then paste in your LLMs
=> Just replace "github" with "gitingest" in the url, and you get the whole repo as a single string that you can then paste in your LLMs
Post
4875
"๐ฎ๐ฌ๐ฎ๐ฑ ๐๐ถ๐น๐น ๐ฏ๐ฒ ๐๐ต๐ฒ ๐๐ฒ๐ฎ๐ฟ ๐ผ๐ณ ๐๐ ๐ฎ๐ด๐ฒ๐ป๐๐": this statement has often been made, here are numbers to support it.
I've plotted the progress of AI agents on GAIA test set, and it seems they're headed to catch up with the human baseline in early 2026.
And that progress is still driven mostly by the improvement of base LLMs: progress would be even faster with fine-tuned agentic models.
I've plotted the progress of AI agents on GAIA test set, and it seems they're headed to catch up with the human baseline in early 2026.
And that progress is still driven mostly by the improvement of base LLMs: progress would be even faster with fine-tuned agentic models.
Post
3768
๐๐ฑ๐๐ฒ๐ป'๐ ๐ป๐ฒ๐ ๐๐ฎ๐๐ฎ ๐๐ด๐ฒ๐ป๐๐ ๐๐ฒ๐ป๐ฐ๐ต๐บ๐ฎ๐ฟ๐ธ ๐๐ต๐ผ๐๐ ๐๐ต๐ฎ๐ ๐๐ฒ๐ฒ๐ฝ๐ฆ๐ฒ๐ฒ๐ธ-๐ฅ๐ญ ๐๐๐ฟ๐๐ด๐ด๐น๐ฒ๐ ๐ผ๐ป ๐ฑ๐ฎ๐๐ฎ ๐๐ฐ๐ถ๐ฒ๐ป๐ฐ๐ฒ ๐๐ฎ๐๐ธ๐! โ
โก๏ธ How well do reasoning models perform on agentic tasks? Until now, all indicators seemed to show that they worked really well. On our recent reproduction of Deep Search, OpenAI's o1 was by far the best model to power an agentic system.
So when our partner Adyen built a huge benchmark of 450 data science tasks, and built data agents with smolagents to test different models, I expected reasoning models like o1 or DeepSeek-R1 to destroy the tasks at hand.
๐ But they really missed the mark. DeepSeek-R1 only got 1 or 2 out of 10 questions correct. Similarly, o1 was only at ~13% correct answers.
๐ง These results really surprised us. We thoroughly checked them, we even thought our APIs for DeepSeek were broken and colleagues Leandro Anton helped me start custom instances of R1 on our own H100s to make sure it worked well.
But there seemed to be no mistake. Reasoning LLMs actually did not seem that smart. Often, these models made basic mistakes, like forgetting the content of a folder that they had just explored, misspelling file names, or hallucinating data. Even though they do great at exploring webpages through several steps, the same level of multi-step planning seemed much harder to achieve when reasoning over files and data.
It seems like there's still lots of work to do in the Agents x Data space. Congrats to Adyen for this great benchmark, looking forward to see people proposing better agents! ๐
Read more in the blog post ๐ https://huggingface.co/blog/dabstep
โก๏ธ How well do reasoning models perform on agentic tasks? Until now, all indicators seemed to show that they worked really well. On our recent reproduction of Deep Search, OpenAI's o1 was by far the best model to power an agentic system.
So when our partner Adyen built a huge benchmark of 450 data science tasks, and built data agents with smolagents to test different models, I expected reasoning models like o1 or DeepSeek-R1 to destroy the tasks at hand.
๐ But they really missed the mark. DeepSeek-R1 only got 1 or 2 out of 10 questions correct. Similarly, o1 was only at ~13% correct answers.
๐ง These results really surprised us. We thoroughly checked them, we even thought our APIs for DeepSeek were broken and colleagues Leandro Anton helped me start custom instances of R1 on our own H100s to make sure it worked well.
But there seemed to be no mistake. Reasoning LLMs actually did not seem that smart. Often, these models made basic mistakes, like forgetting the content of a folder that they had just explored, misspelling file names, or hallucinating data. Even though they do great at exploring webpages through several steps, the same level of multi-step planning seemed much harder to achieve when reasoning over files and data.
It seems like there's still lots of work to do in the Agents x Data space. Congrats to Adyen for this great benchmark, looking forward to see people proposing better agents! ๐
Read more in the blog post ๐ https://huggingface.co/blog/dabstep
Post
9937
Introducing ๐ผ๐ฝ๐ฒ๐ป ๐๐ฒ๐ฒ๐ฝ-๐ฅ๐ฒ๐๐ฒ๐ฎ๐ฟ๐ฐ๐ต by Hugging Face! ๐ฅ
OpenAI's latest agentic app Deep Research seems really good... But it's closed, as usual.
โฑ๏ธ So with a team of cracked colleagues, we set ourselves a 24hours deadline to replicate and open-source Deep Research! โฑ๏ธ
โก๏ธ We built open-Deep-Research, an entirely open agent that can: navigate the web autonomously, scroll and search through pages, download and manipulate files, run calculation on data...
We aimed for the best performance: are the agent's answers really rigorous?
On GAIA benchmark, Deep Research had 67% accuracy on the validation set.
โก๏ธ open Deep Research is at 55% (powered by o1), it is:
- the best pass@1 solution submitted
- the best open solution ๐ช๐ช
And it's only getting started ! Please jump in, drop PRs, and let's bring it to the top !
Read the blog post ๐ https://huggingface.co/blog/open-deep-research
OpenAI's latest agentic app Deep Research seems really good... But it's closed, as usual.
โฑ๏ธ So with a team of cracked colleagues, we set ourselves a 24hours deadline to replicate and open-source Deep Research! โฑ๏ธ
โก๏ธ We built open-Deep-Research, an entirely open agent that can: navigate the web autonomously, scroll and search through pages, download and manipulate files, run calculation on data...
We aimed for the best performance: are the agent's answers really rigorous?
On GAIA benchmark, Deep Research had 67% accuracy on the validation set.
โก๏ธ open Deep Research is at 55% (powered by o1), it is:
- the best pass@1 solution submitted
- the best open solution ๐ช๐ช
And it's only getting started ! Please jump in, drop PRs, and let's bring it to the top !
Read the blog post ๐ https://huggingface.co/blog/open-deep-research
Post
3162
Now you can launch a code agent directly from your terminal!
โจ ๐๐๐๐๐๐๐๐๐ "๐๐๐๐ ๐๐๐๐" directly launches a CodeAgent
โถ๏ธ This also works with web agents (replace ๐๐๐๐๐๐๐๐๐ with ๐ ๐๐๐๐๐๐๐) thanks to @merve !
๐พ Another treat from smolagents release 1.7.0:
Now agents have a memory mechanism, enabling many possibilities like replaying the last run with ๐๐๐๐๐.๐๐๐๐๐๐ข(), thank you @clefourrier !
Check the release notes here ๐ https://github.com/huggingface/smolagents/releases/tag/v1.7.0
โจ ๐๐๐๐๐๐๐๐๐ "๐๐๐๐ ๐๐๐๐" directly launches a CodeAgent
โถ๏ธ This also works with web agents (replace ๐๐๐๐๐๐๐๐๐ with ๐ ๐๐๐๐๐๐๐) thanks to @merve !
๐พ Another treat from smolagents release 1.7.0:
Now agents have a memory mechanism, enabling many possibilities like replaying the last run with ๐๐๐๐๐.๐๐๐๐๐๐ข(), thank you @clefourrier !
Check the release notes here ๐ https://github.com/huggingface/smolagents/releases/tag/v1.7.0
Post
4118
๐ง๐ต๐ฒ ๐๐๐ฏ ๐๐ฒ๐น๐ฐ๐ผ๐บ๐ฒ๐ ๐ฒ๐
๐๐ฒ๐ฟ๐ป๐ฎ๐น ๐ถ๐ป๐ณ๐ฒ๐ฟ๐ฒ๐ป๐ฐ๐ฒ ๐ฝ๐ฟ๐ผ๐๐ถ๐ฑ๐ฒ๐ฟ๐!
โ Hosting our own inference was not enough: now the Hub 4 new inference providers: fal, Replicate, SambaNova Systems, & Together AI.
Check model cards on the Hub: you can now, in 1 click, use inference from various providers (cf video demo)
Their inference can also be used through our Inference API client. There, you can use either your custom provider key, or your HF token, then billing will be handled directly on your HF account, as a way to centralize all expenses.
๐ธ Also, PRO users get 2$ inference credits per month!
Read more in the announcement ๐ https://huggingface.co/blog/inference-providers
โ Hosting our own inference was not enough: now the Hub 4 new inference providers: fal, Replicate, SambaNova Systems, & Together AI.
Check model cards on the Hub: you can now, in 1 click, use inference from various providers (cf video demo)
Their inference can also be used through our Inference API client. There, you can use either your custom provider key, or your HF token, then billing will be handled directly on your HF account, as a way to centralize all expenses.
๐ธ Also, PRO users get 2$ inference credits per month!
Read more in the announcement ๐ https://huggingface.co/blog/inference-providers