HuggingFaceFW-Dev

Enterprise

Activity Feed

AI & ML interests

None defined yet.

Recent Activity

thomwolf authored a paper 9 days ago

FineWeb2: One Pipeline to Scale Them All -- Adapting Pre-Training Data Processing to Every Language

lvwerra authored a paper 9 days ago

FineWeb2: One Pipeline to Scale Them All -- Adapting Pre-Training Data Processing to Every Language

mjaggi authored a paper 9 days ago

FineWeb2: One Pipeline to Scale Them All -- Adapting Pre-Training Data Processing to Every Language

View all activity

thomwolf

authored a paper 9 days ago

FineWeb2: One Pipeline to Scale Them All -- Adapting Pre-Training Data Processing to Every Language

Paper • 2506.20920 • Published 10 days ago • 57

lvwerra

authored a paper 9 days ago

FineWeb2: One Pipeline to Scale Them All -- Adapting Pre-Training Data Processing to Every Language

Paper • 2506.20920 • Published 10 days ago • 57

mjaggi

authored a paper 9 days ago

FineWeb2: One Pipeline to Scale Them All -- Adapting Pre-Training Data Processing to Every Language

Paper • 2506.20920 • Published 10 days ago • 57

negar-foroutan

authored a paper 9 days ago

FineWeb2: One Pipeline to Scale Them All -- Adapting Pre-Training Data Processing to Every Language

Paper • 2506.20920 • Published 10 days ago • 57

vsabolcec

authored a paper 9 days ago

FineWeb2: One Pipeline to Scale Them All -- Adapting Pre-Training Data Processing to Every Language

Paper • 2506.20920 • Published 10 days ago • 57

hynky

authored a paper 9 days ago

FineWeb2: One Pipeline to Scale Them All -- Adapting Pre-Training Data Processing to Every Language

Paper • 2506.20920 • Published 10 days ago • 57

guipenedo

authored a paper 9 days ago

FineWeb2: One Pipeline to Scale Them All -- Adapting Pre-Training Data Processing to Every Language

Paper • 2506.20920 • Published 10 days ago • 57

Ameeeee

posted an update 10 days ago

Post

1542

🤗 Here’s a fun educational video I made to show how Sheets and AI can upgrade your structured content.

Better tables and clearer messages with just a little help from open-source AI!

aisheets/sheets

Ameeeee

posted an update 25 days ago

Post

1754

With Sheets, try a new way to create structured content with the help of AI!

No installs. No login. Just open a link and 🤩

This app lets you create a dataset by importing a file or starting from a prompt.

What’s different about SHEETS?
🔎 Web search integration to ground answers in real-world data
📚 In-context learning from validated sources
🔗 Transparent sourcing — every result is linked
🧩 Runs on multiple open-source models

Fight hallucinations and start creating content you can rely on.

dvilasuero

posted an update 25 days ago

Post

2608

Super excited to launch Hugging Face Sheets: Spreadsheets meet AI and unstructured data.

A few months ago, we started imagining new ways to build and transform datasets with the latest open-source models.

Today, I'm thrilled to introduce our first step in this direction.

In a nutshell:

📁 Effortlessly run prompts and models over your data.
🌐 Agentic search for accuracy and real-time information.
🖼️ Familiar, minimalistic interface for interacting with data.
🎯 Human feedback 2.0: Your input directly improves generated data.
💯 Access hundreds of open models and leading inference providers.

Go to this space to try it out!

aisheets/sheets

Leave your questions below, we're just getting started!

2 replies

davanstrien

posted an update 27 days ago

Post

2874

Inspired by Hugging Face's official MCP server, I've developed a complementary tool that exposes my semantic search API to enhance discovery across the HF platform.

Key capabilities:

- AI-powered semantic search for models and datasets
- Parameter count analysis via safetensors metadata
- Trending content discovery
- Find similar models/datasets functionality
- 11 tools total for enhanced ecosystem navigation

The semantic search goes beyond simple keyword matching, understanding context and relationships between different models and datasets.

Example query: "Find around 10 reasoning Hugging Face datasets published in 2025 focusing on topics other than maths and science. Show a link and a short summary for each dataset." (results in video!)

https://github.com/davanstrien/hub-semantic-search-mcp

loubnabnl

authored a paper 28 days ago

The Common Pile v0.1: An 8TB Dataset of Public Domain and Openly Licensed Text

Paper • 2506.05209 • Published about 1 month ago • 42

guipenedo

authored a paper 29 days ago

The Common Pile v0.1: An 8TB Dataset of Public Domain and Openly Licensed Text

Paper • 2506.05209 • Published about 1 month ago • 42

eliebak

authored a paper 29 days ago

The Common Pile v0.1: An 8TB Dataset of Public Domain and Openly Licensed Text

Paper • 2506.05209 • Published about 1 month ago • 42

thomwolf

authored a paper about 1 month ago

SmolVLA: A Vision-Language-Action Model for Affordable and Efficient Robotics

Paper • 2506.01844 • Published Jun 2 • 108

loubnabnl

posted an update about 2 months ago

Post

2969

SmolVLM is now available on PocketPal — you can run it offline on your smartphone to interpret the world around you. 🌍📱

And check out this real-time camera demo by @ngxson , powered by llama.cpp:
https://github.com/ngxson/smolvlm-realtime-webcam
https://x.com/pocketpal_ai

3 replies

lewtun

authored a paper about 2 months ago

Kimina-Prover Preview: Towards Large Formal Reasoning Models with Reinforcement Learning

Paper • 2504.11354 • Published Apr 15 • 6

davanstrien

posted an update 2 months ago

Post

2261

Came across a very nice submission from @marcodsn for the reasoning datasets competition (https://huggingface.co/blog/bespokelabs/reasoning-datasets-competition).

The dataset distils reasoning chains from arXiv research papers in biology and economics. Some nice features of the dataset:

- Extracts both the logical structure AND researcher intuition from academic papers
- Adopts the persona of researchers "before experiments" to capture exploratory thinking
- Provides multi-short and single-long reasoning formats with token budgets - Shows 7.2% improvement on MMLU-Pro Economics when fine-tuning a 3B model

It's created using the Curator framework with plans to scale across more scientific domains and incorporate multi-modal reasoning with charts and mathematics.

I personally am very excited about datasets like this, which involve creativity in their creation and don't just rely on $$$ to produce a big dataset with little novelty.

Dataset can be found here: marcodsn/academic-chains (give it a like!)

thomwolf

posted an update 3 months ago

Post

5388

If you've followed the progress of robotics in the past 18 months, you've likely noticed how robotics is increasingly becoming the next frontier that AI will unlock.

At Hugging Face—in robotics and across all AI fields—we believe in a future where AI and robots are open-source, transparent, and affordable; community-built and safe; hackable and fun. We've had so much mutual understanding and passion working with the Pollen Robotics team over the past year that we decided to join forces!

You can already find our open-source humanoid robot platform Reachy 2 on the Pollen website and the Pollen community and people here on the hub at

pollen-robotics

We're so excited to build and share more open-source robots with the world in the coming months!

1 reply

davanstrien

posted an update 3 months ago

Post

1700

I've created a v1 dataset ( davanstrien/reasoning-required) and model ( davanstrien/ModernBERT-based-Reasoning-Required) to help curate "wild text" data for generating reasoning examples beyond the usual code/math/science domains.

- I developed a "Reasoning Required" dataset with a 0-4 scoring system for reasoning complexity
- I used educational content from HuggingFaceFW/fineweb-edu, adding annotations for domains, reasoning types, and example questions

My approach enables a more efficient workflow: filter text with small models first, then use LLMs only on high-value content.

This significantly reduces computation costs while expanding reasoning dataset domain coverage.

AI & ML interests

Recent Activity

Team members 17

HuggingFaceFW-Dev's activity