HPLT

community

hplt_eu

AI & ML interests

Web as a corpus, Large Language Models, Machine Translation, Language Technologies, Natural Language Processing, Internet Archive, CommonCrawl

Recent Activity

bhavitvyamalik updated a dataset 11 days ago

HPLT/DocHPLT

pinzhenchen updated a dataset 13 days ago

HPLT/DocHPLT

Dayyyan authored a paper 20 days ago

DocHPLT: A Massively Multilingual Document-Level Translation Dataset

View all activity

davanstrien

posted an update 5 days ago

Post

347

I fine-tuned a smol VLM to generate specialized art history metadata!

davanstrien/iconclass-vlm: Qwen2.5-VL-3B trained using SFT to generate ICONCLASS codes (think Dewey Decimal for art!)

Trained with TRL + HF Jobs - single UV script, no GPU needed!

Space to explore predictions on a test set: davanstrien/iconclass-predictions

Blog soon!

bhavitvyamalik

updated a dataset 11 days ago

HPLT/DocHPLT

Viewer • Updated 11 days ago • 124M • 178 • 8

pinzhenchen

updated a dataset 13 days ago

HPLT/DocHPLT

Viewer • Updated 11 days ago • 124M • 178 • 8

Dayyyan

authored a paper 20 days ago

DocHPLT: A Massively Multilingual Document-Level Translation Dataset

Paper • 2508.13079 • Published 21 days ago • 1

BramVanroy

posted an update 26 days ago

Post

578

Thanks to popular request, I've just added two subsets to the CommonCrawl-Creative Commons Corpus (C5; BramVanroy/CommonCrawl-CreativeCommons) so that you do not have to do filtering manually

- C5f ( BramVanroy/CommonCrawl-CreativeCommons-fine): only retains high-quality samples that are also present in FineWeb or FineWeb-2;
- C5r (https://huggingface.co/datasets/BramVanroy/CommonCrawl-CreativeCommons-recommended): additional strict filtering that removes samples with license disagreement, non-commercial licenses, and Wikipedia samples. The latter because you should probably get those from a more reliable source that provides better parsed content.

It goes without saying that these filters lead to a massive reduction in quantity. Doc and token counts are given on the dataset pages.

Villekom

authored 2 papers 2 months ago

Got Compute, but No Data: Lessons From Post-training a Finnish LLM

Paper • 2503.09407 • Published Mar 12 • 1

An Expanded Massive Multilingual Dataset for High-Performance Language Technologies

Paper • 2503.10267 • Published Mar 13 • 1

davanstrien

posted an update 3 months ago

Post

3564

Inspired by Hugging Face's official MCP server, I've developed a complementary tool that exposes my semantic search API to enhance discovery across the HF platform.

Key capabilities:

- AI-powered semantic search for models and datasets
- Parameter count analysis via safetensors metadata
- Trending content discovery
- Find similar models/datasets functionality
- 11 tools total for enhanced ecosystem navigation

The semantic search goes beyond simple keyword matching, understanding context and relationships between different models and datasets.

Example query: "Find around 10 reasoning Hugging Face datasets published in 2025 focusing on topics other than maths and science. Show a link and a short summary for each dataset." (results in video!)

https://github.com/davanstrien/hub-semantic-search-mcp

1 reply

tiedeman

authored a paper 3 months ago

Massively Multilingual Adaptation of Large Language Models Using Bilingual Translation Data

Paper • 2506.00469 • Published May 31 • 2

jisx

authored 8 papers 3 months ago

Towards Interpretable Mental Health Analysis with Large Language Models

Paper • 2304.03347 • Published Apr 6, 2023

A Survey on Knowledge Graphs: Representation, Acquisition and Applications

Paper • 2002.00388 • Published Feb 2, 2020 • 1

Suicidal Ideation and Mental Disorder Detection with Attentive Relation Networks

Paper • 2004.07601 • Published Apr 16, 2020

A New Massive Multilingual Dataset for High-Performance Language Technologies

Paper • 2403.14009 • Published Mar 20, 2024 • 1

Lucky 52: How Many Languages Are Needed to Instruction Fine-Tune Large Language Models?

Paper • 2404.04850 • Published Apr 7, 2024

GlotEval: A Test Suite for Massively Multilingual Evaluation of Large Language Models

Paper • 2504.04155 • Published Apr 5 • 1

Rethinking Multilingual Continual Pretraining: Data Mixing for Adapting LLMs Across Languages and Resources

Paper • 2504.04152 • Published Apr 5 • 1

Massively Multilingual Adaptation of Large Language Models Using Bilingual Translation Data

Paper • 2506.00469 • Published May 31 • 2

BramVanroy

posted an update 4 months ago

Post

3488

📢💾 Introducing the Common Crawl Creative Commons Corpus (C5)!

C5 is a large-scale effort to heavily filter web-crawled data, as collected by the non-profit Common Crawl, to only documents that are Creative Commons-licensed such as cc-by-4.0 or public domain cc0. At this stage 150 billion tokens have been collected.

---
📄 data: BramVanroy/CommonCrawl-CreativeCommons
🧰 software: https://github.com/BramVanroy/CommonCrawl-CreativeCommons
---

</> To build C5, HTML pages are scrutinized and all links (if any) to CC licenses are collected, both in regular hyperlinks as well as in metadata. Additional data fields are included such as "was the license found in the head?" or "if multiple licenses were found, do they contradict each other?", which makes further filtering a breeze.

🌐 In this first version of C5, 8 languages are included (Afrikaans, German, English, French, Frysian, Italian, Dutch and Spanish). The language set was limited for two reasons: computational and storage limitations, and a collaboration with GPT-NL, which requested CC data for these languages to train a Dutch-focused, copyright-conscious LLM. In total, this V1 release contains almost 150 thousand documents and 150 billion tokens. This data was not filtered on quality nor deduplicated so that you can decide for yourself how much data to keep. To give some quality indication, a dataset field is present to describe whether a document is included in the FineWeb(-2) datasets, which are of high quality.

🔍 More work needs to be done! Only 7 out of 100+ Common Crawl crawls have been processed so far. That's encouraging because it means there is a lot more Creative Commons data to be collected! But to get there I need help in terms of compute. The current processing was already heavily sponsored by the Flemish Supercomputer but more is needed. If you have the compute available and which to collaborate in an open and transparent manner, please get in touch!

1 reply

davanstrien

posted an update 5 months ago

Post

2326

Came across a very nice submission from @marcodsn for the reasoning datasets competition (https://huggingface.co/blog/bespokelabs/reasoning-datasets-competition).

The dataset distils reasoning chains from arXiv research papers in biology and economics. Some nice features of the dataset:

- Extracts both the logical structure AND researcher intuition from academic papers
- Adopts the persona of researchers "before experiments" to capture exploratory thinking
- Provides multi-short and single-long reasoning formats with token budgets - Shows 7.2% improvement on MMLU-Pro Economics when fine-tuning a 3B model

It's created using the Curator framework with plans to scale across more scientific domains and incorporate multi-modal reasoning with charts and mathematics.

I personally am very excited about datasets like this, which involve creativity in their creation and don't just rely on $$$ to produce a big dataset with little novelty.

Dataset can be found here: marcodsn/academic-chains (give it a like!)

davanstrien

posted an update 5 months ago

Post

1739

I've created a v1 dataset ( davanstrien/reasoning-required) and model ( davanstrien/ModernBERT-based-Reasoning-Required) to help curate "wild text" data for generating reasoning examples beyond the usual code/math/science domains.

- I developed a "Reasoning Required" dataset with a 0-4 scoring system for reasoning complexity
- I used educational content from HuggingFaceFW/fineweb-edu, adding annotations for domains, reasoning types, and example questions

My approach enables a more efficient workflow: filter text with small models first, then use LLMs only on high-value content.

This significantly reduces computation costs while expanding reasoning dataset domain coverage.

AI & ML interests

Recent Activity

Team members 28

HPLT's activity