Hub

Datasets Overview

Datasets on the Hub

The Hugging Face Hub hosts a large number of community-curated datasets for a diverse range of tasks such as translation, automatic speech recognition, and image classification. Alongside the information contained in the dataset card, many datasets, such as GLUE, include a Dataset Viewer to showcase the data.

Each dataset is a Git repository that contains the data required to generate splits for training, evaluation, and testing. For information on how a dataset repository is structured, refer to the Data files Configuration page. Following the supported repo structure will ensure that the dataset page on the Hub will have a Viewer.

Search for datasets

Like models and spaces, you can search the Hub for datasets using the search bar in the top navigation or on the main datasets page. There’s a large number of languages, tasks, and licenses that you can use to filter your results to find a dataset that’s right for you.

Privacy

Since datasets are repositories, you can toggle their visibility between private and public through the Settings tab. If a dataset is owned by an organization, the privacy settings apply to all the members of the organization.

Update on GitHub