Train Small, Infer Large: Memory-Efficient LoRA Training for Large Language Models
Abstract
Large Language Models (LLMs) have significantly advanced natural language processing with exceptional task generalization capabilities. Low-Rank Adaption (LoRA) offers a cost-effective fine-tuning solution, freezing the original model parameters and training only lightweight, low-rank adapter matrices. However, the memory footprint of LoRA is largely dominated by the original model parameters. To mitigate this, we propose LoRAM, a memory-efficient LoRA training scheme founded on the intuition that many neurons in over-parameterized LLMs have low training utility but are essential for inference. LoRAM presents a unique twist: it trains on a pruned (small) model to obtain pruned low-rank matrices, which are then recovered and utilized with the original (large) model for inference. Additionally, minimal-cost continual pre-training, performed by the model publishers in advance, aligns the knowledge discrepancy between pruned and original models. Our extensive experiments demonstrate the efficacy of LoRAM across various pruning strategies and downstream tasks. For a model with 70 billion parameters, LoRAM enables training on a GPU with only 20G HBM, replacing an A100-80G GPU for LoRA training and 15 GPUs for full fine-tuning. Specifically, QLoRAM implemented by structured pruning combined with 4-bit quantization, for LLaMA-3.1-70B (LLaMA-2-70B), reduces the parameter storage cost that dominates the memory usage in low-rank matrix training by 15.81times (16.95times), while achieving dominant performance gains over both the original LLaMA-3.1-70B (LLaMA-2-70B) and LoRA-trained LLaMA-3.1-8B (LLaMA-2-13B).
Community
Dear AK and HF Team,
Buckle up for a wild ride into the world of large language models! 🚀 Ever wished you could fine-tune massive LLMs without needing a full-blown data center? Well, dream no more! Our new approach, LoRAM, is here to train small and infer large—bringing you memory-efficient LoRA training without sacrificing performance.
Imagine turning a 70-billion-parameter beast into a nimble, memory-efficient marvel—like transforming an elephant into a sleek race car! 🐘➡️🏎️ We take the classic LoRA method, give it a trendy haircut by pruning away those underutilized neurons 💇♂️, and then recover the pruned low-rank matrices to supercharge the full model during inference.
The Challenge 🤯
While LoRA offers a cost-effective fine-tuning solution, the memory footprint remains dominated by the original model parameters. Training a 70B model traditionally demands an A100-80G GPU or even a fleet of 15 GPUs. Yikes!
The LoRAM Magic 🪄
LoRAM turns this challenge on its head by:
Tiny Yet Mighty: Training on a pruned (small) model with just 20G HBM—no need for heavyweight GPUs! 🎉
Wallet-Friendly Wizardry: Using structured pruning combined with 4-bit quantization (QLoRAM) slashes storage costs by up to 16.95×, proving that efficiency and performance can indeed dance together! 💃💸
Seamless Sync: Minimal-cost continual pre-training aligns the knowledge between the pruned and original models, ensuring no magic is lost in translation. 🔗✨
The Results 🤯🚀
With LoRAM, we not only achieve dominant performance gains over both the original 70B model and smaller LoRA-trained models but also make massive model training accessible—running on a single 20G GPU!
Curious to see the magic in action? Check out our paper and code:
Paper: Train Small, Infer Large: Memory-Efficient LoRA Training for Large Language Models
GitHub: https://github.com/junzhang-zj/LoRAM
We can’t wait for you to join us on this exhilarating journey where smart engineering meets a splash of neural magic! 😄🌟
Cheers,
The LoRAM Team
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Gradient Weight-normalized Low-rank Projection for Efficient LLM Training (2024)
- SSMLoRA: Enhancing Low-Rank Adaptation with State Space Model (2025)
- DiffoRA: Enabling Parameter-Efficient LLM Fine-Tuning via Differential Low-Rank Matrix Adaptation (2025)
- LoRS: Efficient Low-Rank Adaptation for Sparse Large Language Model (2025)
- In-Context Meta LoRA Generation (2025)
- SaLoRA: Safety-Alignment Preserved Low-Rank Adaptation (2025)
- RoRA: Efficient Fine-Tuning of LLM with Reliability Optimization for Rank Adaptation (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper