Several methods/models have recently been shared to generate synthetic data from minimal or no initial seeds, essentially creating data directly from raw text.
IMO, these approaches that rely on smaller models for synthetic data generation are quite valuable for scaling up synthetic data and democratizing access to creating domain-specific synthetic datasets.
@davanstrien Thanks for the wonderful demos! Just wanted to highlight that we recently released Bonito, an open-source model that converts user's raw text into instruction tuning dataset. It would be awesome if you could add our model to the collection! Happy to help :)