Model Card for Model ID
โ ๏ธ This is a temporary repository for our [EMNLP 2025] demo paper submission.
The project is currently hosted here for review and demonstration purposes.
It will be migrated to the official organization repository once it becomes available.
All code, models, and documentation are maintained here until then.
Github: LMT
Model Details
Model Description
BiMaTE (Bi-Centric Machine Translation Expert) is a large-scale, LLM-based, Chinese-English-Centric multilingual translation model designed to facilitate high-quality translation between Chinese, English, and numerous other global languages.
- Model type: Causal Language Model for Machine Translation
- Languages: 60
- Translation directions: 234
- Base Model: Qwen3-8B-Base
- Training Strategy:
- Monolingual Continual Pretraining (CPT): 30B tokens
- Mixed Continual Pretraining (CPT): 60B tokens (monolingual, bilingual)
- Supervised Finetuning (SFT): Post-training on smaller-scale, high-quality translation data.
Quickstart
from transformers import AutoModelForCausalLM, AutoTokenizer
model_name = "luoyingfeng/BiMaTE-8B"
tokenizer = AutoTokenizer.from_pretrained(model_name, padding_side='left')
model = AutoModelForCausalLM.from_pretrained(model_name)
prompt = "Translate the following text from English into Chinese.\nEnglish: The concept came from China where plum blossoms were the flower of choice.\nChinese: "
messages = [{"role": "user", "content": prompt}]
text = tokenizer.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True,
)
model_inputs = tokenizer([text], return_tensors="pt").to(model.device)
generated_ids = model.generate(**model_inputs, max_new_tokens=512, num_beams=5, do_sample=False)
output_ids = generated_ids[0][len(model_inputs.input_ids[0]):].tolist()
outputs = tokenizer.batch_decode(output_ids, skip_special_tokens=True)
print("response:", outputs)
Support Languages
Resource Tier | Languages |
---|---|
High-resource Languages (13) | Arabic(ar), English(en), Spanish(es), German(de), French(fr), Italian(it), Japanese(ja), Dutch(nl), Polish(pl), Portuguese(pt), Russian(ru), Turkish(tr), Chinese(zh) |
Medium-resource Languages (18) | Bulgarian(bg), Bengali(bn), Czech(cs), Danish(da), Modern Greek(el), Persian(fa), Finnish(fi), Hindi(hi), Hungarian(hu), Indonesian(id), Korean(ko), Norwegian(no), Romanian(ro), Slovak(sk), Swedish(sv), Thai(th), Ukrainian(uk), Vietnamese(vi) |
Low-resouce Languages (29) | Amharic(am), Azerbaijani(az), Tibetan(bo), Modern Hebrew(he), Croatian(hr), Armenian(hy), Icelandic(is), Javanese(jv), Georgian(ka), Kazakh(kk), Central Khmer(km), Kirghiz(ky), Lao(lo), Mongolian(mn), Marathi(mr), Malay(ms), Burmese(my), Nepali(ne), Pashto(ps), Sinhala(si), Swahili(sw), Tamil(ta), Telugu(te), Tajik(tg), Tagalog(tl), Uighur(ug), Urdu(ur), Uzbek(uz), Yue Chinese(yue) |
- Downloads last month
- 22
Inference Providers
NEW
This model isn't deployed by any Inference Provider.
๐
Ask for provider support