arxiv:2510.05528

ARMOR: High-Performance Semi-Structured Pruning via Adaptive Matrix Factorization

Published on Oct 7

· Submitted by

Lawrence Liu on Oct 13

University of California, Los Angeles

Upvote

Authors:

Lawrence Liu ,

Abstract

ARMOR, a novel post-training pruning algorithm, enhances the performance of large language models by factorizing weight matrices into sparse cores and block diagonal matrices, achieving better accuracy and memory efficiency compared to conventional pruning methods.

AI-generated summary

Large language models (LLMs) present significant deployment challenges due to their immense computational and memory requirements. While semi-structured pruning, particularly 2:4 sparsity, offers a path to practical hardware acceleration, existing methods often incur substantial performance degradation. To bridge this gap, we introduce ARMOR: (Adaptive Representation with Matrix-factORization), a novel one-shot post-training pruning algorithm. Instead of directly pruning weights, ARMOR factorizes each weight matrix into a 2:4 sparse core wrapped by two low-overhead, block diagonal matrices. These wrappers act as efficient pre and post-transformation error correctors, offering greater flexibility to preserve model quality compared to conventional 2:4 pruning techniques. The sparse core and block diagonal wrappers are chosen through a block coordinate descent algorithm that minimizes a layer-wise proxy loss. We theoretically prove this optimization is guaranteed to converge to a solution with a proxy loss less than or equal to state-of-the-art pruning algorithms. Experiments on Llama (Touvron et al., 2023; Dubey et al., 2024) and Qwen (Yang et al., 2025) model families demonstrate that ARMOR consistently and significantly outperforms state-of-the-art 2:4 pruning methods across a wide range of downstream tasks and perplexity evaluations. ARMOR achieves this superior performance while retaining the inference speedups and substantial memory usage reductions of 2:4 pruning, establishing a more effective trade-off between model compression and task accuracy

View arXiv page View PDF Add to collection

Community

LawrenceLiu

Paper author Paper submitter 4 days ago

ARMOR is a new semi-structured pruning method that introduces a matrix factorization consisting of a 2:4 sparse core surrounded by two lightweight block diagonal matrices. We introduce a one-shot compression algorithm using this factorization, and show that it significantly and consistently outperforms existing one-shot semi-structured pruning algorithms on an industry standard suite of benchmarks.

librarian-bot

3 days ago

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2510.05528 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2510.05528 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2510.05528 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.