DCAD-2000: A Multilingual Dataset across 2000+ Languages with Data Cleaning as Anomaly Detection Paper • 2502.11546 • Published Feb 17, 2025
From Unaligned to Aligned: Scaling Multilingual LLMs with Multi-Way Parallel Corpora Paper • 2505.14045 • Published May 20, 2025