arxiv:2502.13533

Train Small, Infer Large: Memory-Efficient LoRA Training for Large Language Models

Published on Feb 19

· Submitted by

junzhang98 on Feb 20

Upvote

Authors:

Jun Zhang ,

Kunlong Zhou

Abstract

Large Language Models (LLMs) have significantly advanced natural language processing with exceptional task generalization capabilities. Low-Rank Adaption (LoRA) offers a cost-effective fine-tuning solution, freezing the original model parameters and training only lightweight, low-rank adapter matrices. However, the memory footprint of LoRA is largely dominated by the original model parameters. To mitigate this, we propose LoRAM, a memory-efficient LoRA training scheme founded on the intuition that many neurons in over-parameterized LLMs have low training utility but are essential for inference. LoRAM presents a unique twist: it trains on a pruned (small) model to obtain pruned low-rank matrices, which are then recovered and utilized with the original (large) model for inference. Additionally, minimal-cost continual pre-training, performed by the model publishers in advance, aligns the knowledge discrepancy between pruned and original models. Our extensive experiments demonstrate the efficacy of LoRAM across various pruning strategies and downstream tasks. For a model with 70 billion parameters, LoRAM enables training on a GPU with only 20G HBM, replacing an A100-80G GPU for LoRA training and 15 GPUs for full fine-tuning. Specifically, QLoRAM implemented by structured pruning combined with 4-bit quantization, for LLaMA-3.1-70B (LLaMA-2-70B), reduces the parameter storage cost that dominates the memory usage in low-rank matrix training by 15.81times (16.95times), while achieving dominant performance gains over both the original LLaMA-3.1-70B (LLaMA-2-70B) and LoRA-trained LLaMA-3.1-8B (LLaMA-2-13B).

View arXiv page View PDF Add to collection

Community

junzhang98

Paper author Paper submitter 1 day ago

•

edited about 14 hours ago

Dear AK and HF Team,

Buckle up for a wild ride into the world of large language models! 🚀 Ever wished you could fine-tune massive LLMs without needing a full-blown data center? Well, dream no more! Our new approach, LoRAM, is here to train small and infer large—bringing you memory-efficient LoRA training without sacrificing performance.

Imagine turning a 70-billion-parameter beast into a nimble, memory-efficient marvel—like transforming an elephant into a sleek race car! 🐘➡️🏎️ We take the classic LoRA method, give it a trendy haircut by pruning away those underutilized neurons 💇‍♂️, and then recover the pruned low-rank matrices to supercharge the full model during inference.

The Challenge 🤯
While LoRA offers a cost-effective fine-tuning solution, the memory footprint remains dominated by the original model parameters. Training a 70B model traditionally demands an A100-80G GPU or even a fleet of 15 GPUs. Yikes!

The LoRAM Magic 🪄
LoRAM turns this challenge on its head by:

Tiny Yet Mighty: Training on a pruned (small) model with just 20G HBM—no need for heavyweight GPUs! 🎉
Wallet-Friendly Wizardry: Using structured pruning combined with 4-bit quantization (QLoRAM) slashes storage costs by up to 16.95×, proving that efficiency and performance can indeed dance together! 💃💸
Seamless Sync: Minimal-cost continual pre-training aligns the knowledge between the pruned and original models, ensuring no magic is lost in translation. 🔗✨
The Results 🤯🚀
With LoRAM, we not only achieve dominant performance gains over both the original 70B model and smaller LoRA-trained models but also make massive model training accessible—running on a single 20G GPU!

Curious to see the magic in action? Check out our paper and code:

Paper: Train Small, Infer Large: Memory-Efficient LoRA Training for Large Language Models
GitHub: https://github.com/junzhang-zj/LoRAM
We can’t wait for you to join us on this exhilarating journey where smart engineering meets a splash of neural magic! 😄🌟

Cheers,
The LoRAM Team

librarian-bot

about 15 hours ago

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2502.13533 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2502.13533 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2502.13533 in a Space README.md to link it from this page.