ReplaceMe
Collection
Pruning with a training-free approach
β’
5 items
β’
Updated
β’
2
ReplaceMe is a novel method for transformer model compression that enables training-free block/layer pruning while maintaining model performance through linear transformations. The approach:
Method | Train-Free? | C3 | CMNLI | CHID (test) | WSC | HellaSwag | PIQA | Race-M | Race-H | MMLU | CMMLU | AVG | RP |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Llama 2 7B (baseline) | 43.8 | 33.0 | 41.6 | 37.5 | 71.3 | 78.1 | 33.1 | 35.5 | 46.8 | 31.8 | 45.3 | 100.0% | |
LLM-Streamline* | β | π 43.3 | 33.0 | 24.1 | 36.5 | π 61.1 | π 71.5 | 34.8 | 37.0 | 45.5 | 29.4 | 41.6 | 92.0% |
LLMPruner* | β | 29.7 | 33.4 | 28.4 | 40.4 | 54.6 | 72.0 | 22.9 | 22.0 | 25.3 | 25.0 | 35.4 | 78.2% |
SliceGPT* | β | 31.5 | 31.6 | 18.5 | 43.3 | 47.5 | 68.3 | 27.0 | 29.4 | 28.8 | 24.8 | 35.1 | 77.5% |
LaCo* | β | 39.7 | π 34.4 | π 36.1 | 40.4 | 55.7 | 69.8 | 23.6 | 22.6 | 26.5 | 25.2 | 37.4 | 82.7% |
UIDL* | β | 40.2 | π 34.4 | 21.5 | 40.4 | 59.7 | 69.0 | 35.2 | 34.7 | 44.6 | 28.9 | 40.9 | 90.3% |
ReplaceMe (this model) | β | 42.5 | 33.0 | 25.2 | 38.5 | 59.4 | 71.1 | 35.4 | π 36.7 | π 46.4 | π 30.4 | π 41.9 | π 92.5% |
Key:
Metrics Explained:
π₯ Our training-free methods achieve 92.5% of baseline performance while other approaches require expensive retraining!
pip install replaceme
# or
git clone https://github.com/mts-ai/ReplaceMe
cd ReplaceMe
pip install -e .
# LSTSQ method (recommended)
run_replaceme --config ./reproduce/Replace_Me_pipeline_lstsq.yaml
# Cosine similarity method
run_replaceme --config ./reproduce/Replace_Me_pipeline_cosine.yaml
There are many parameters you can play with, visit our repo and dscover π₯π₯
As we said we are merging the LTs with the original transformer architecture so you just do it as usual
## EXAMPLE
from transformers import AutoModelForCausalLM, AutoTokenizer
model_name = "MTSAIR/Llama2-5B-ReplaceMe"
model = AutoModelForCausalLM.from_pretrained(
model_name,
torch_dtype="auto",
device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained(model_name)
prompt = "What is ReplaceME pruning method?!"
messages = [
{"role": "user", "content": prompt}
]
text = tokenizer.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True
)
model_inputs = tokenizer([text], return_tensors="pt").to(model.device)
output = model.generate(
**model_inputs,
max_new_tokens=512
)
response = tokenizer.batch_decode(output, skip_special_tokens=True)[0]
If you use ReplaceMe in your research, please cite our paper:
@article{shopkhoev2025replaceme0,
title = {ReplaceMe: Network Simplification via Layer Pruning and Linear Transformations},
author = {Dmitriy Shopkhoev and Ammar Ali and Magauiya Zhussip and Valentin Malykh and Stamatios Lefkimmiatis and Nikos Komodakis and Sergey Zagoruyko},
year = {2025},
journal = {arXiv preprint arXiv: 2505.02819}
}