data-agents (Data Agents)

Releasing the Jupyter Agent Dataset! 🚀

Built from 7 TB of real Kaggle datasets + 20k notebooks, creating real code exec traces using Qwen3-Coder and E2B.
Training on this data dramatically improves the ability to execute code and analyze data.

We ( @baptistecolle @hannayukhymenko @lvwerra ) have created a novel synthetic data generation pipeline with efficient scaffolding, which gives a big performance boost after training your coding agent🔥With the help of real Kaggle notebooks and datasets we generate synthetic notebooks which aim to analyze datasets and answer factual questions about them more efficiently. We simulate a real code execution environment by prompting LLMs or with the help of E2B sandboxes. We have built a dataset of 50k+ high-quality LLM-generated notebooks which can help your agent become better at performing data analysis and question answering.

Link: data-agents/jupyter-agent-dataset

3 replies

·

hannayukhymenko

published a dataset 6 days ago

data-agents/jupyter-agent-dataset

Viewer • Updated 4 days ago • 95.8k • 1.81k • 109

eggie5

authored a paper 21 days ago

DABstep: Data Agent Benchmark for Multi-step Reasoning

Paper • 2506.23719 • Published Jun 30

thomwolf

authored a paper 2 months ago

FineWeb2: One Pipeline to Scale Them All -- Adapting Pre-Training Data Processing to Every Language

Paper • 2506.20920 • Published Jun 26 • 70

lvwerra

authored a paper 2 months ago

FineWeb2: One Pipeline to Scale Them All -- Adapting Pre-Training Data Processing to Every Language

Paper • 2506.20920 • Published Jun 26 • 70

freddyaboulton

posted an update 3 months ago

Post

3832

The new multimodalart/self-forcing model and demo are truly impressive!

freddyaboulton

posted an update 3 months ago

Post

593

Time is running out! ⏰

Less than 24 hours to participate in the MCP Hackathon and win thousands of dollars in prizes! Don't miss this opportunity to showcase your skills.

Visit Agents-MCP-Hackathon/AI-Marketing-Content-Creator to register!

freddyaboulton

posted an update 3 months ago

Post

439

🚨 NotebookLM Dethroned?! 🚨

Meet Fluxions vui: The new open-source dialogue generation model.
🤯 100M Params, 40k hours audio!
🎙️ Multi-speaker audio
😂 Non-speech sounds (like [laughs]!)
📜 MIT License

Is this the future of content creation? Watch the video and decide for yourself!

https://huggingface.co/spaces/fluxions/vui-spacehttps://huggingface.co/fluxions/vui

1 reply

·

loubnabnl

authored a paper 3 months ago

The Common Pile v0.1: An 8TB Dataset of Public Domain and Openly Licensed Text

Paper • 2506.05209 • Published Jun 5 • 46

thomwolf

authored a paper 3 months ago

SmolVLA: A Vision-Language-Action Model for Affordable and Efficient Robotics

Paper • 2506.01844 • Published Jun 2 • 131

loubnabnl

posted an update 4 months ago

Post

4138

SmolVLM is now available on PocketPal — you can run it offline on your smartphone to interpret the world around you. 🌍📱

And check out this real-time camera demo by @ngxson , powered by llama.cpp:
https://github.com/ngxson/smolvlm-realtime-webcam
https://x.com/pocketpal_ai

3 replies

·

lewtun

authored a paper 4 months ago

Kimina-Prover Preview: Towards Large Formal Reasoning Models with Reinforcement Learning

Paper • 2504.11354 • Published Apr 15 • 6

hannayukhymenko

posted an update 5 months ago

Post

3639

🚀 We are delighted to announce MamayLM, a new state-of-the-art efficient Ukrainian LLM!

📈 MamayLM surpasses similar-sized models in both English and Ukrainian, while matching or overtaking up to 10x larger models.

📊 MamayLM is a 9B model that can run on a single GPU, enabling cost-efficient AI autonomy and adoption across sectors in Ukraine such as education, legal, healthcare, public services and others (e.g., by specializing it to particular use cases). MalayLM is also attractive for organizations wishing to preserve data privacy as it s efficiency allows it to run on a local machine.

🧠 MamayLM is trained on high-quality Ukrainian data and understands Ukrainian language, culture, and history. It is built on top of Google’s Gemma 2 9B model, but uses a number of new advances stemming from INSAIT’s experience in creating BgGPT, a Bulgarian LLM we released last year, now adopted nationwide and profiled several times by Google as a worldwide success case.

🤝 MamayLM is developed in a collaboration between researchers at INSAIT and ETH Zürich and is trained entirely via donations to INSAIT for AI compute resources.

📥 MamayLM is now freely available to download on INSAIT’s HuggingFace in both full and quantized versions. We also publicly release all Ukrainian benchmarks we evaluated on.

📝 Further, we release blog posts in both English and Ukrainian, sharing our approach to creating MamayLM, hoping to drive further improvements by the community.

🌎 The release of LLMs for various languages is part of INSAIT’s mission in ensuring countries can achieve AI autonomy in a cost-efficient, controlled, safe and predictable manner.

MamayLM model and benchmarks:

INSAIT-Institute
Blog (EN): https://huggingface.co/blog/INSAIT-Institute/mamaylm
Blog (UKR): https://huggingface.co/blog/INSAIT-Institute/mamaylm-ukr

1 reply

·

thomwolf

posted an update 5 months ago

Post

6559

If you've followed the progress of robotics in the past 18 months, you've likely noticed how robotics is increasingly becoming the next frontier that AI will unlock.

At Hugging Face—in robotics and across all AI fields—we believe in a future where AI and robots are open-source, transparent, and affordable; community-built and safe; hackable and fun. We've had so much mutual understanding and passion working with the Pollen Robotics team over the past year that we decided to join forces!

You can already find our open-source humanoid robot platform Reachy 2 on the Pollen website and the Pollen community and people here on the hub at

pollen-robotics

We're so excited to build and share more open-source robots with the world in the coming months!

1 reply

·

thomwolf

authored a paper 5 months ago

SmolVLM: Redefining small and efficient multimodal models

Paper • 2504.05299 • Published Apr 7 • 199

lewtun

authored a paper 5 months ago

SmolVLM: Redefining small and efficient multimodal models

Paper • 2504.05299 • Published Apr 7 • 199

Data Agents

AI & ML interests

Recent Activity

[bot] Conversion to Parquet

data-agents/jupyter-agent-dataset

I run but error

I run space, but error

data-agents/jupyter-agent-dataset

DABstep: Data Agent Benchmark for Multi-step Reasoning

FineWeb2: One Pipeline to Scale Them All -- Adapting Pre-Training Data Processing to Every Language

FineWeb2: One Pipeline to Scale Them All -- Adapting Pre-Training Data Processing to Every Language

The Common Pile v0.1: An 8TB Dataset of Public Domain and Openly Licensed Text

SmolVLA: A Vision-Language-Action Model for Affordable and Efficient Robotics

Kimina-Prover Preview: Towards Large Formal Reasoning Models with Reinforcement Learning

SmolVLM: Redefining small and efficient multimodal models

SmolVLM: Redefining small and efficient multimodal models

AI & ML interests

Recent Activity

Team members 16

data-agents's activity

[bot] Conversion to Parquet

I run but error

I run space, but error