Abstract
We introduce Trillion-7B, the most token-efficient Korean-centric multilingual LLM available. Our novel Cross-lingual Document Attention (XLDA) mechanism enables highly efficient and effective knowledge transfer from English to target languages like Korean and Japanese. Combined with optimized data mixtures, language-specific filtering, and tailored tokenizer construction, Trillion-7B achieves competitive performance while dedicating only 10\% of its 2T training tokens to multilingual data and requiring just 59.4K H100 GPU hours (\$148K) for full training. Comprehensive evaluations across 27 benchmarks in four languages demonstrate Trillion-7B's robust multilingual performance and exceptional cross-lingual consistency.
Community
Technical report for Trillion-7B, Trillion Lab's latest large language model designed to push the boundaries of multilingual scalability and performance.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Compass-V2 Technical Report (2025)
- Llama-3-Nanda-10B-Chat: An Open Generative Large Language Model for Hindi (2025)
- PolyPrompt: Automating Knowledge Extraction from Multilingual Language Models with Dynamic Prompt Generation (2025)
- Investigating and Scaling up Code-Switching for Multilingual Language Model Pre-Training (2025)
- Kanana: Compute-efficient Bilingual Language Models (2025)
- EuroBERT: Scaling Multilingual Encoders for European Languages (2025)
- BitNet b1.58 2B4T Technical Report (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 1
Datasets citing this paper 0
No dataset linking this paper