ReplaceMe: Training-Free Transformer Pruning via Layer Removal & Linear Transformations

arXiv License

ReplaceMe Logo

Model Description

ReplaceMe is a novel method for transformer model compression that enables training-free block/layer pruning while maintaining model performance through linear transformations. The approach:

  • Identifies and removes block of layers
  • Applies mathematically-derived transformations to preserve information flow
  • Requires no fine-tuning or retraining
  • Works with standard transformer architectures (The LTs are merged with the original model weights)

Key Features

  • πŸš€ Zero-Training Pruning: Remove layers without any fine-tuning
  • 🧠 Performance Preservation: <8% accuracy drop in most cases
  • ⚑ Instant Speedup: less blocks -> faster inference + less memory
  • πŸ”Œ Plug-and-Play: Works with existing HuggingFace models

πŸ”₯ Performance Comparison of Pruning Methods (Llama 2 7B, 25% Compression)

Method Train-Free? C3 CMNLI CHID (test) WSC HellaSwag PIQA Race-M Race-H MMLU CMMLU AVG RP
Llama 2 7B (baseline) 43.8 33.0 41.6 37.5 71.3 78.1 33.1 35.5 46.8 31.8 45.3 100.0%
LLM-Streamline* ❌ πŸ† 43.3 33.0 24.1 36.5 πŸ† 61.1 πŸ† 71.5 34.8 37.0 45.5 29.4 41.6 92.0%
LLMPruner* ❌ 29.7 33.4 28.4 40.4 54.6 72.0 22.9 22.0 25.3 25.0 35.4 78.2%
SliceGPT* ❌ 31.5 31.6 18.5 43.3 47.5 68.3 27.0 29.4 28.8 24.8 35.1 77.5%
LaCo* ❌ 39.7 πŸ† 34.4 πŸ† 36.1 40.4 55.7 69.8 23.6 22.6 26.5 25.2 37.4 82.7%
UIDL* ❌ 40.2 πŸ† 34.4 21.5 40.4 59.7 69.0 35.2 34.7 44.6 28.9 40.9 90.3%
ReplaceMe (this model) βœ… 42.5 33.0 25.2 38.5 59.4 71.1 35.4 πŸ† 36.7 πŸ† 46.4 πŸ† 30.4 πŸ† 41.9 πŸ† 92.5%

Key:

  • πŸ† Best performance in column
  • βœ… Training-free (our methods)
  • ❌ Requires training
  • *Numbers taken from Streamline paper

Metrics Explained:

  • RP: Relative Performance (% of baseline)
  • Bold: Best training-free results
  • All numbers are accuracy scores

πŸ”₯ Our training-free methods achieve 92.5% of baseline performance while other approaches require expensive retraining!

Installation

pip install replaceme
# or
git clone https://github.com/mts-ai/ReplaceMe
cd ReplaceMe
pip install -e .

Basic Usage

# LSTSQ method (recommended)
run_replaceme --config ./reproduce/Replace_Me_pipeline_lstsq.yaml

# Cosine similarity method
run_replaceme --config ./reproduce/Replace_Me_pipeline_cosine.yaml

There are many parameters you can play with, visit our repo and dscover πŸ”₯πŸ”₯

Load Model

As we said we are merging the LTs with the original transformer architecture so you just do it as usual

## EXAMPLE
from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "MTSAIR/Llama2-5B-ReplaceMe"

model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype="auto",
    device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained(model_name)

prompt = "What is ReplaceME pruning method?!"
messages = [
    {"role": "user", "content": prompt}
]
text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True
)
model_inputs = tokenizer([text], return_tensors="pt").to(model.device)

output = model.generate(
    **model_inputs,
    max_new_tokens=512
)
response = tokenizer.batch_decode(output, skip_special_tokens=True)[0]

Citation

If you use ReplaceMe in your research, please cite our paper:

@article{shopkhoev2025replaceme0,
  title   = {ReplaceMe: Network Simplification via Layer Pruning and Linear Transformations},
  author  = {Dmitriy Shopkhoev and Ammar Ali and Magauiya Zhussip and Valentin Malykh and Stamatios Lefkimmiatis and Nikos Komodakis and Sergey Zagoruyko},
  year    = {2025},
  journal = {arXiv preprint arXiv: 2505.02819}
}
Downloads last month
28
Safetensors
Model size
5.12B params
Tensor type
BF16
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Collection including MTSAIR/Llama2-5B-ReplaceMe