Pangu Ultra: Pushing the Limits of Dense Large Language Models on Ascend NPUs
Abstract
We present Pangu Ultra, a Large Language Model (LLM) with 135 billion parameters and dense Transformer modules trained on Ascend Neural Processing Units (NPUs). Although the field of LLM has been witnessing unprecedented advances in pushing the scale and capability of LLM in recent years, training such a large-scale model still involves significant optimization and system challenges. To stabilize the training process, we propose depth-scaled sandwich normalization, which effectively eliminates loss spikes during the training process of deep models. We pre-train our model on 13.2 trillion diverse and high-quality tokens and further enhance its reasoning capabilities during post-training. To perform such large-scale training efficiently, we utilize 8,192 Ascend NPUs with a series of system optimizations. Evaluations on multiple diverse benchmarks indicate that Pangu Ultra significantly advances the state-of-the-art capabilities of dense LLMs such as Llama 405B and Mistral Large 2, and even achieves competitive results with DeepSeek-R1, whose sparse model structure contains much more parameters. Our exploration demonstrates that Ascend NPUs are capable of efficiently and effectively training dense models with more than 100 billion parameters. Our model and system will be available for our commercial customers.
Community
Pangu Ultra, a Large Language Model (LLM) with 135 billion parameters and dense Transformer modules trained on Ascend Neural Processing Units (NPUs).
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- PLM: Efficient Peripheral Language Models Hardware-Co-Designed for Ubiquitous Computing (2025)
- SPPO:Efficient Long-sequence LLM Training via Adaptive Sequence Pipeline Parallel Offloading (2025)
- ByteScale: Efficient Scaling of LLM Training with a 2048K Context Length on More Than 12,000 GPUs (2025)
- xLSTM 7B: A Recurrent LLM for Fast and Efficient Inference (2025)
- SWAN-GPT: An Efficient and Scalable Approach for Long-Context Language Modeling (2025)
- NeoBERT: A Next-Generation BERT (2025)
- Llamba: Scaling Distilled Recurrent Models for Efficient Language Processing (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper