531 46 146

Loubna Ben Allal

loubnabnl

https://loubnabnl.github.io/

AI & ML interests

SmolLMs, ML for code, data

Recent Activity

liked a dataset 5 days ago

data-agents/jupyter-agent-dataset

updated a collection 6 days ago

SmolLM 🤏

liked a model about 1 month ago

unsloth/gpt-oss-20b-GGUF

View all activity

Organizations

liked a dataset 5 days ago

data-agents/jupyter-agent-dataset

Viewer • Updated 4 days ago • 95.8k • 1.81k • 108

updated a collection 6 days ago

SmolLM 🤏

Collection

SmolLM models, datasets and demos • 11 items • Updated 6 days ago

liked a model about 1 month ago

unsloth/gpt-oss-20b-GGUF

Text Generation • 21B • Updated 18 days ago • 431k • 369

upvoted 4 changelogs about 1 month ago

Changelog

Inference Providers now fully support OpenAI-compatible API

Jul 18

• 88

Changelog

JSON Support in the Dataset Viewer

Jul 23

• 47

Changelog

Introducing HF Jobs: Run scalable compute jobs on Hugging Face

Jul 30

• 175

Changelog

Trending Papers

Jul 28

• 87

liked a dataset about 1 month ago

HuggingFaceTB/smollm-corpus

Viewer • Updated Sep 6, 2024 • 237M • 17.1k • 372

New activity in HuggingFaceTB/SmolLM3-3B-Base about 1 month ago

Thai, Japanese, Korea, and Vietnamese in SmolLM3?

#8 opened about 1 month ago by

wannaphong

liked 3 models about 1 month ago

New activity in HuggingFaceTB/SmolLM3-3B-checkpoints about 1 month ago

will open the "short sft" model?

#5 opened about 1 month ago by

leo98xh

commented on SmolLM3: smol, multilingual, long-context reasoner about 1 month ago

1- we used the Olmo2 pes2o dataset in allenai/olmo-mix-1124
2- GitHub issues are from HuggingFaceTB/issues-kaggle-notebooks. Pull Requests and Jupyter Notebooks are part of The Stack v2 for which you will need the Software Heritage agreement. Otherwise, for notebooks you could use The Stack v1 jupyter scripts which are a bit similar, except that we remove any special tokens and clean up some Juytext leftovers (like the metadata or # - and # + lines)
3 - Yes Kaggle notebooks are from there

updated a dataset about 1 month ago

HuggingFaceTB/smollm3-configs

Updated Aug 4 • 142 • 3

commented on SmolLM3: smol, multilingual, long-context reasoner about 1 month ago

You can find the latest pretraining configs in this folder: https://github.com/huggingface/smollm/blob/main/text/pretraining/smollm3/

We didn't do further processing of the datasets (except for filtering on number of tokens for the context extension phase and formatting Q&A and instruct data for the decay, as explained here). You can find the datasets in this collection
Some notes (highlighted in the configs on the GitHub repo) :

we use FineWeb2-HQ for all the languages below except Hindi, Thai, Korean for which we use FineWeb2
we don't use Stack-Edu in phase 1 for code (despite the s3 bucket name), that's StarCoder2Data (stack v2)

Not sure we can share the s3 buckets but all the s3 datasets are already tokenized anyway.
But let us know if you have specific questions about some datasets, we're happy to elaborate.

published a dataset about 1 month ago

HuggingFaceTB/stackexchange_2025_md

Updated Mar 25 • 2.22k • 1

commented on SmolLM3: smol, multilingual, long-context reasoner about 1 month ago

we used LLama 3.2 tokenizer as is (except for removing bos_token), here are some details about how Meta built the tokenizer (from llama3 paper)

We use a vocabulary with 128K tokens. Our token vocabulary combines 100K tokens from the tiktoken3tokenizer with 28K additional tokens to better support non-English languages. Compared to the Llama
2 tokenizer, our new tokenizer improves compression rates on a sample of English data from 3.17 to
3.94 characters per token. This enables the model to “read” more text for the same amount of training
compute. We also found that adding 28K tokens from select non-English languages improved both
compression ratios and downstream performance, with no impact on English tokenization.

liked a model about 1 month ago

zai-org/GLM-4.5

Text Generation • 358B • Updated 28 days ago • 94.2k • • 1.29k

commented on SmolLM3: smol, multilingual, long-context reasoner about 1 month ago

Yes the wandb logs are here: https://wandb.ai/huggingface/SmolLM3-training-logs . They were released along with the intermediate checkpoints https://x.com/eliebakouch/status/1947314495103160458

Loubna Ben Allal

AI & ML interests

Recent Activity

Organizations

loubnabnl's activity

Inference Providers now fully support OpenAI-compatible API

JSON Support in the Dataset Viewer

Introducing HF Jobs: Run scalable compute jobs on Hugging Face

Trending Papers

Thai, Japanese, Korea, and Vietnamese in SmolLM3?

will open the "short sft" model?