Andrea Soria's picture

Andrea Soria

asoria

·

AI & ML interests

Maintainer of 🤗Datasets: Data processing

Organizations

Posts 4

Post

2184

🚀 Exploring Topic Modeling with BERTopic 🤖

When you come across an interesting dataset, you often wonder:
Which topics frequently appear in these documents? 🤔
What is this data really about? 📊

Topic modeling helps answer these questions by identifying recurring themes within a collection of documents. This process enables quick and efficient exploratory data analysis.

I’ve been working on an app that leverages BERTopic, a flexible framework designed for topic modeling. Its modularity makes BERTopic powerful, allowing you to switch components with your preferred algorithms. It also supports handling large datasets efficiently by merging models using the BERTopic.merge_models approach. 🔗

🔍 How do we make this work?
Here’s the stack we’re using:

📂 Data Source ➡️ Hugging Face datasets with DuckDB for retrieval
🧠 Text Embeddings ➡️ Sentence Transformers (all-MiniLM-L6-v2)
⚡ Dimensionality Reduction ➡️ RAPIDS cuML UMAP for GPU-accelerated performance
🔍 Clustering ➡️ RAPIDS cuML HDBSCAN for fast clustering
✂️ Tokenization ➡️ CountVectorizer
🔧 Representation Tuning ➡️ KeyBERTInspired + Hugging Face Inference Client with Meta-Llama-3-8B-Instruct
🌍 Visualization ➡️ Datamapplot library
Check out the space and see how you can quickly generate topics from your dataset: datasets-topics/topics-generator

Powered by @MaartenGr - BERTopic

Articles 7

Article

4

📄 PDF Support in the Hugging Face Dataset Viewer

View all Articles

Collections 3

View 3 collections

spaces 10

AlfredAgent

First Agent Template

Get current time in any timezone

Auto notebook creator

Generate a Jupyter notebook for a Hugging Face dataset

GenAI notebook creator

Dataset Insights Explorer

Analyze datasets and generate insights

Datasets text2sql

Generate SQL queries from text for Hugging Face datasets

models 12

asoria/bertopic_github_dataset_viewer_issues

Text Classification • Updated Sep 26, 2024 • 1

asoria/transformers_issues_topics

Text Classification • Updated Sep 26, 2024 • 7

asoria/facebook-opt-350m-asoria-love-poems

Text Generation • 0.3B • Updated Sep 24, 2024 • 2

asoria/facebook-opt-350m-asoria-english-quotes-text

0.3B • Updated Sep 18, 2024 • 2

asoria/facebook-opt-350m-asoria-bolivian-recipes

Text Generation • 0.3B • Updated Sep 18, 2024 • 1

asoria/outputs

Text Generation • 0.3B • Updated Sep 18, 2024 • 1

asoria/bert-base-uncased-ag-news

Text Classification • 0.1B • Updated Sep 17, 2024 • 3

asoria/gpt2-tweet_sentiment_extraction

Text Classification • 0.1B • Updated Sep 17, 2024 • 3

asoria/google-gemma-7b-abirate-english_quotes

Text Generation • Updated Sep 16, 2024 • 5

asoria/facebook-opt-350m-imdb

Text Generation • 0.3B • Updated Sep 16, 2024 • 7

datasets 64

asoria/big_pdf

Viewer • Updated May 30, 2025 • 1.94k • 16

asoria/pdf_folder_2

Viewer • Updated May 23, 2025 • 7 • 19

asoria/pdf1

Viewer • Updated May 23, 2025 • 25 • 19

asoria/image_folder

Viewer • Updated May 22, 2025 • 2 • 12

asoria/audio_test

Viewer • Updated May 21, 2025 • 3 • 13

asoria/dataset-notebook-creator-content

Updated Apr 30, 2025 • 7.35k • 1

asoria/abstracts_and_tweets

Viewer • Updated Jan 21, 2025 • 100 • 14

asoria/children-stories-dataset

Viewer • Updated Jan 7, 2025 • 10 • 14 • 1

asoria/webdataset_test

Viewer • Updated Dec 17, 2024 • 110 • 3

asoria/test_repo

Viewer • Updated Dec 13, 2024 • 22 • 24

View 64 datasets