--- license: cc-by-sa-4.0 language: - bg - cs - da - de - el - en - es - et - fi - fr - ga - hr - hu - it - lt - lv - mt - nl - pl - pt - ro - sk - sl - sv --- # Dactory models ## Model description This is a set of fastText-based models to evaluate the quality and domain of text, in the 24 official languages of the European Union. The main usage of these models is to preprocess data from the Common Crawl project, to obtain a training set for large language models. These models can be used as part of the dactory pipeline, released by Kyutai to process Common Crawl. There is one model per language, and each model is a multilabel classifier with the eight following labels: random webpages (`rand`), Wikipedia articles (`wiki`), textbooks (`books`), scientific articles from pes2o (`science`), Stack Exchange websites related to STEM (`stem`), Humanities (`hum`), pop culture (`pop`) and life advices (`life`). The models were trained to distinguish lines sampled uniformly from these different sources. To get training data for the languages other than English, we translated the English training set with MADLAD, except for the `rand` and `wiki` labels, for which data is readily available in all languages. * **Model name**: Dactory models * **Languages**: Bulgarian, Czech, Danish, German, Greek, English, Spanish, Estonian, Finnish, French, Irish, Croatian, Hungarian, Italian, Lithuanian, Latvian, Maltese, Dutch, Polish, Portuguese, Romanian, Slovak, Slovenian, Swedish * **Developed by**: Kyutai * **Model type**: Classification * **License**: CC-BY-SA 4.0 * **Version**: 1.0 * **Released**: April 2025 ## Use cases These models can we used to evaluate the quality of text, by estimating how similar it is to text from high quality sources. In particular, one can take the score corresponding to the `rand` label as an estimate of the text quality. They can also be used to organize a collection of documents, by similarity to the different data sources used to train the model. For example, a large language model trained mostly on documents labeled as `books` will perform well on multi-choice Q&A benchmarks such as MMLU, while a LLM trained mostly on documents labeled as `wiki` will perform well on general knowledge Q&A benchmark such as TriviaQA. ## How to use You can download the files locally by using the [huggingface-hub Python package](https://huggingface.co/docs/hub/en/models-downloading). For example: ```python import fasttext from huggingface_hub import hf_hub_download local_path = hf_hub_download(repo_id="kyutai/dactory-models", filename="filter_en.bin") model = fasttext.load_model(local_path) print(model.predict("A computer scientist is a scientist who specializes in the academic study of computer science.")) ```