HuggingFaceBR4

company

Activity Feed

AI & ML interests

Removing dependencies, building everything from scratch

Recent Activity

hynky authored a paper 2 days ago

Towards Best Practices for Open Datasets for LLM Training

guipenedo authored a paper 2 days ago

Towards Best Practices for Open Datasets for LLM Training

thomwolf authored a paper 2 days ago

Towards Best Practices for Open Datasets for LLM Training

View all activity

HuggingFaceBR4's activity

hynky

authored a paper 2 days ago

Towards Best Practices for Open Datasets for LLM Training

Paper • 2501.08365 • Published 4 days ago • 40

guipenedo

authored a paper 2 days ago

Towards Best Practices for Open Datasets for LLM Training

Paper • 2501.08365 • Published 4 days ago • 40

thomwolf

authored a paper 2 days ago

Towards Best Practices for Open Datasets for LLM Training

Paper • 2501.08365 • Published 4 days ago • 40

hlarcher

posted an update 2 days ago

Post

976

We are introducing multi-backend support in Hugging Face Text Generation Inference!
With new TGI architecture we are now able to plug new modeling backends to get best performances according to selected model and available hardware. This first step will very soon be followed by the integration of new backends (TRT-LLM, llama.cpp, vLLM, Neuron and TPU).

We are polishing the TensorRT-LLM backend which achieves impressive performances on NVIDIA GPUs, stay tuned 🤗 !

Check out the details: https://huggingface.co/blog/tgi-multi-backend

thomwolf

posted an update about 1 month ago

Post

4868

We are proud to announce HuggingFaceFW/fineweb-2: A sparkling update to HuggingFaceFW/fineweb with 1000s of 🗣️languages.

We applied the same data-driven approach that led to SOTA English performance in🍷 FineWeb to thousands of languages.

🥂 FineWeb2 has 8TB of compressed text data and outperforms other multilingual datasets in our experiments.

The dataset is released under the permissive 📜 ODC-By 1.0 license, and the 💻 code to reproduce it and our evaluations is public.

We will very soon announce a big community project, and are working on a 📝 blogpost walking you through the entire dataset creation process. Stay tuned!

In the mean time come ask us question on our chat place: HuggingFaceFW/discussion

H/t @guipenedo @hynky @lvwerra as well as @vsabolcec Bettina Messmer @negar-foroutan and @mjaggi

2 replies

clefourrier

authored a paper about 1 month ago

Global MMLU: Understanding and Addressing Cultural and Linguistic Biases in Multilingual Evaluation

Paper • 2412.03304 • Published Dec 4, 2024 • 17

thomwolf

posted an update about 1 month ago

Post

1262

Exponentially growing number of open-source AI models over the course of the past 30 months – from a few thousands to over 1 million and more

Interactive data viz: huggingface/open-source-ai-year-in-review-2024

thomwolf

posted an update about 2 months ago

Post

1440

Most liked and most downloaded open-source AI models from 2022 to 2024

Interactive viz: https://aiworld.eu/embed/model/model/treemap
Discussion: huggingface/open-source-ai-year-in-review-2024

loubnabnl

posted an update about 2 months ago

Post

1905

Making SmolLM2 reproducible: open-sourcing our training & evaluation toolkit 🛠️ https://github.com/huggingface/smollm/

- Pre-training code with nanotron
- Evaluation suite with lighteval
- Synthetic data generation using distilabel (powers our new SFT dataset HuggingFaceTB/smoltalk)
- Post-training scripts with TRL & the alignment handbook
- On-device tools with llama.cpp for summarization, rewriting & agents

Apache 2.0 licensed. V2 pre-training data mix coming soon!

Which other tools should we add next?

thomwolf

posted an update about 2 months ago

Post

1688

Interesting long read from @evanmiller-anthropic on having a better founded statistical approach to Language Model Evaluations:
https://www.anthropic.com/research/statistical-approach-to-model-evals

Worth a read if you're into LLM evaluations!

Cc @clefourrier

1 reply

thomwolf

posted an update about 2 months ago

Post

1438

Very exciting new mistralai/Pixtral-Large-Instruct-2411 model from Mistral-AI

Impressive performances, huge congrats @patrickvonplaten @sgvaze @pandora-s @devendrachaplot @sophiamyang and team!

Very nice to have SOTA Multilingual OCR and Chart understanding in an open-weights model

thomwolf

posted an update 3 months ago

Post

4158

Parents in the 1990: Teach the kids to code
Parents now: Teach the kids to fix the code when it starts walking around 🤖✨

2 replies

clefourrier

authored 2 papers 7 months ago

The Hallucinations Leaderboard -- An Open Effort to Measure Hallucinations in Large Language Models

Paper • 2404.05904 • Published Apr 8, 2024 • 8

GAIA: a benchmark for General AI Assistants

Paper • 2311.12983 • Published Nov 21, 2023 • 187

thomwolf

authored a paper 7 months ago

The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale

Paper • 2406.17557 • Published Jun 25, 2024 • 90

craffel

authored a paper 7 months ago

The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale

Paper • 2406.17557 • Published Jun 25, 2024 • 90

hynky

authored a paper 7 months ago

The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale

Paper • 2406.17557 • Published Jun 25, 2024 • 90

loubnabnl

authored a paper 7 months ago

The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale

Paper • 2406.17557 • Published Jun 25, 2024 • 90

guipenedo

authored a paper 7 months ago

The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale

Paper • 2406.17557 • Published Jun 25, 2024 • 90

eliebak

posted an update 7 months ago

Post

1197

Wow, impressive 340B model by nvidia with a nice permissive license! 🚀 The technical report is full of insights and seems to use a different learning rate schedule than cosine, probably a variant of WSD. Hope to get more info on that! 👀

nvidia/nemotron-4-340b-666b7ebaf1b3867caf2f1911

AI & ML interests

Recent Activity

Team members 12

HuggingFaceBR4's activity