--- language: - tr license: apache-2.0 library_name: transformers pipeline_tag: text-classification base_model: jhu-clsp/mmBERT-small tags: - quality-classifier - data-filtering - pretraining ---

MixMinMatch Collection

# mmBERT Turkish Quality Classifier A text quality classifier for Turkish pretraining data, trained from [mmBERT-small](https://huggingface.co/jhu-clsp/mmBERT-small). This model implements the FineWeb2-HQ approach ([Messmer et al., 2025](https://arxiv.org/abs/2502.10361)) but uses mmBERT as the encoder for improved Turkish understanding. ## Usage ```python from transformers import pipeline classifier = pipeline("text-classification", model="AdaMLLab/mmBERT-Turkish-Quality-Classifier") result = classifier("Türkçe metin burada") ``` ## Citation ```bib @misc{alrashed2025mixminmatch, title={Mix, MinHash, and Match: Cross-Source Agreement for Multilingual Pretraining Datasets}, author={Sultan Alrashed and Francesco Orabona}, year={2025}, eprint={2512.18834v2}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2512.18834v2}, } ```