-
davanstrien/reasoning-required
Viewer • Updated • 5k • 123 • 19 -
davanstrien/ModernBERT-based-Reasoning-Required
Text Classification • 0.1B • Updated • 29 • 7 -
davanstrien/fineweb-with-reasoning-scores-and-topics
Viewer • Updated • 10k • 16 • 1 -
davanstrien/fine-reasoning-questions
Viewer • Updated • 244 • 130 • 18
Daniel van Strien PRO
davanstrien
AI & ML interests
Machine Learning Librarian
Recent Activity
updated
a dataset
about 1 hour ago
data-is-better-together/fineweb-c-progress
updated
a dataset
about 1 hour ago
librarian-bots/model_cards_with_metadata
updated
a dataset
about 2 hours ago
librarian-bots/dataset_cards_with_metadata
Organizations
Maths reasoning
Maths reasoning datasets found using https://huggingface.co/spaces/librarian-bots/huggingface-datasets-semantic-search
sentence-transformers-from-synthetic-data
Example of using distilabel to generate synthetic triplets data for fine-tuning a Sentence Transformer model
haiku
🌸 This is a collection of synthetic datasets built to help improve the ability of open language models to better write haikus through the use of DPO
Probably DPO datasets
A collection of datasets that probably support DPO
query-to-hub-datasets-viewer-project
hub-tldr
Creating a smol model for tl;dr-ing the hub
-
davanstrien/Smol-Hub-tldr
Text Generation • 0.4B • Updated • 85 • 9 -
Running7979
Semantic Hugging Face Hub Search
🔎Search and find similar datasets
-
davanstrien/hub-tldr-dataset-summaries-llama
Viewer • Updated • 5k • 53 • 1 -
davanstrien/hub-tldr-model-summaries-llama
Viewer • Updated • 5k • 141 • 1
synthetic-data-generation-demos
A collection of demos for various approaches to synthetic data generation
Synthetic (text) Dataset Generation
Papers about synthetic dataset generation
-
Better Synthetic Data by Retrieving and Transforming Existing Datasets
Paper • 2404.14361 • Published • 2 -
Generative AI for Synthetic Data Generation: Methods, Challenges and the Future
Paper • 2403.04190 • Published • 1 -
Best Practices and Lessons Learned on Synthetic Data for Language Models
Paper • 2404.07503 • Published • 32 -
A Multi-Faceted Evaluation Framework for Assessing Synthetic Data Generated by Large Language Models
Paper • 2404.14445 • Published
Historic language modeling
This collection contains models, datasets and spaces related to historic language models i.e. language models trained on historic data
Image Preference Optimization Datasets
Datasets suitable for Image Preference Optimization based on their colum names
Reasoning Required?
-
davanstrien/reasoning-required
Viewer • Updated • 5k • 123 • 19 -
davanstrien/ModernBERT-based-Reasoning-Required
Text Classification • 0.1B • Updated • 29 • 7 -
davanstrien/fineweb-with-reasoning-scores-and-topics
Viewer • Updated • 10k • 16 • 1 -
davanstrien/fine-reasoning-questions
Viewer • Updated • 244 • 130 • 18
hub-tldr
Creating a smol model for tl;dr-ing the hub
-
davanstrien/Smol-Hub-tldr
Text Generation • 0.4B • Updated • 85 • 9 -
Running7979
Semantic Hugging Face Hub Search
🔎Search and find similar datasets
-
davanstrien/hub-tldr-dataset-summaries-llama
Viewer • Updated • 5k • 53 • 1 -
davanstrien/hub-tldr-model-summaries-llama
Viewer • Updated • 5k • 141 • 1
Maths reasoning
Maths reasoning datasets found using https://huggingface.co/spaces/librarian-bots/huggingface-datasets-semantic-search
synthetic-data-generation-demos
A collection of demos for various approaches to synthetic data generation
sentence-transformers-from-synthetic-data
Example of using distilabel to generate synthetic triplets data for fine-tuning a Sentence Transformer model
Synthetic (text) Dataset Generation
Papers about synthetic dataset generation
-
Better Synthetic Data by Retrieving and Transforming Existing Datasets
Paper • 2404.14361 • Published • 2 -
Generative AI for Synthetic Data Generation: Methods, Challenges and the Future
Paper • 2403.04190 • Published • 1 -
Best Practices and Lessons Learned on Synthetic Data for Language Models
Paper • 2404.07503 • Published • 32 -
A Multi-Faceted Evaluation Framework for Assessing Synthetic Data Generated by Large Language Models
Paper • 2404.14445 • Published
haiku
🌸 This is a collection of synthetic datasets built to help improve the ability of open language models to better write haikus through the use of DPO
Historic language modeling
This collection contains models, datasets and spaces related to historic language models i.e. language models trained on historic data
Probably DPO datasets
A collection of datasets that probably support DPO
Image Preference Optimization Datasets
Datasets suitable for Image Preference Optimization based on their colum names
query-to-hub-datasets-viewer-project