--- datasets: - unicamp-dl/mmarco library_name: sentence-transformers pipeline_tag: sentence-similarity tags: - sentence-transformers - feature-extraction - sentence-similarity license: mit widget: [] base_model: - BAAI/bge-m3 --- # BGE-m3 ZH mMARCO/v2 Transliterated Queries tokenised with Anserini This is a [BGE-M3](https://huggingface.co/BAAI/bge-m3) model post-trained on the Chinese dataset from MMARCO/v2. The queries are transliterated Chinese to English using [uroman](https://github.com/isi-nlp/uroman). The queries were additionally tokenised with [pyterrier_anserini](https://github.com/seanmacavaney/pyterrier-anserini/tree/main/pyterrier_anserini). The model was used for the SIGIR 2025 Short paper: Lost in Transliteration: Bridging the Script Gap in Neural IR. ## Model Details ### Model Description - **Model Type:** Sentence Transformer - **Maximum Sequence Length:** 8192 tokens - **Output Dimensionality:** 1024 tokens - **Similarity Function:** Cosine Similarity ## Training Details ### Framework Versions - Python: 3.10.13 - Sentence Transformers: 3.1.1 - Transformers: 4.45.1 - PyTorch: 2.4.1 - Accelerate: 0.34.2 - Datasets: 3.0.1 - Tokenizers: 0.20.3