ReplaceMe: Training-Free Transformer Pruning via Layer Removal & Linear Transformations

Model Description

ReplaceMe is a novel method for transformer model compression that enables training-free block/layer pruning while maintaining model performance through linear transformations. The approach:

Identifies and removes block of layers
Applies mathematically-derived transformations to preserve information flow
Requires no fine-tuning or retraining
Works with standard transformer architectures (The LTs are merged with the original model weights)

Key Features

🚀 Zero-Training Pruning: Remove layers without any fine-tuning
🧠 Performance Preservation: <8% accuracy drop in most cases
⚡ Instant Speedup: less blocks -> faster inference + less memory
🔌 Plug-and-Play: Works with existing HuggingFace models

🔥 Performance Comparison of Pruning Methods (Llama 2 7B, 25% Compression)

Method	Train-Free?	C3	CMNLI	CHID (test)	WSC	HellaSwag	PIQA	Race-M	Race-H	MMLU	CMMLU	AVG	RP
Llama 2 7B (baseline)		43.8	33.0	41.6	37.5	71.3	78.1	33.1	35.5	46.8	31.8	45.3	100.0%
LLM-Streamline*	❌	🏆 43.3	33.0	24.1	36.5	🏆 61.1	🏆 71.5	34.8	37.0	45.5	29.4	41.6	92.0%
LLMPruner*	❌	29.7	33.4	28.4	40.4	54.6	72.0	22.9	22.0	25.3	25.0	35.4	78.2%
SliceGPT*	❌	31.5	31.6	18.5	43.3	47.5	68.3	27.0	29.4	28.8	24.8	35.1	77.5%
LaCo*	❌	39.7	🏆 34.4	🏆 36.1	40.4	55.7	69.8	23.6	22.6	26.5	25.2	37.4	82.7%
UIDL*	❌	40.2	🏆 34.4	21.5	40.4	59.7	69.0	35.2	34.7	44.6	28.9	40.9	90.3%

ReplaceMe (this model)	✅	42.5	33.0	25.2	38.5	59.4	71.1	35.4	🏆 36.7	🏆 46.4	🏆 30.4	🏆 41.9	🏆 92.5%

Key:

🏆 Best performance in column
✅ Training-free (our methods)
❌ Requires training
*Numbers taken from Streamline paper

Metrics Explained:

RP: Relative Performance (% of baseline)
Bold: Best training-free results
All numbers are accuracy scores

🔥 Our training-free methods achieve 92.5% of baseline performance while other approaches require expensive retraining!

Installation

pip install replaceme
# or
git clone https://github.com/mts-ai/ReplaceMe
cd ReplaceMe
pip install -e .

Basic Usage

# LSTSQ method (recommended)
run_replaceme --config ./reproduce/Replace_Me_pipeline_lstsq.yaml

# Cosine similarity method
run_replaceme --config ./reproduce/Replace_Me_pipeline_cosine.yaml

There are many parameters you can play with, visit our repo and dscover 🔥🔥

Load Model

As we said we are merging the LTs with the original transformer architecture so you just do it as usual

## EXAMPLE
from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "MTSAIR/Llama2-5B-ReplaceMe"

model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype="auto",
    device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained(model_name)

prompt = "What is ReplaceME pruning method?!"
messages = [
    {"role": "user", "content": prompt}
]
text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True
)
model_inputs = tokenizer([text], return_tensors="pt").to(model.device)

output = model.generate(
    **model_inputs,
    max_new_tokens=512
)
response = tokenizer.batch_decode(output, skip_special_tokens=True)[0]

Citation

If you use ReplaceMe in your research, please cite our paper:

@article{shopkhoev2025replaceme0,
  title   = {ReplaceMe: Network Simplification via Layer Pruning and Linear Transformations},
  author  = {Dmitriy Shopkhoev and Ammar Ali and Magauiya Zhussip and Valentin Malykh and Stamatios Lefkimmiatis and Nikos Komodakis and Sergey Zagoruyko},
  year    = {2025},
  journal = {arXiv preprint arXiv: 2505.02819}
}

MTSAIR
/

Llama2-5B-ReplaceMe