arxiv:2505.13840

EfficientLLM: Efficiency in Large Language Models

Published on May 20

· Submitted by

Authors:

Abstract

Large Language Models (LLMs) have driven significant progress, yet their growing parameter counts and context windows incur prohibitive compute, energy, and monetary costs. We introduce EfficientLLM, a novel benchmark and the first comprehensive empirical study evaluating efficiency techniques for LLMs at scale. Conducted on a production-class cluster (48xGH200, 8xH200 GPUs), our study systematically explores three key axes: (1) architecture pretraining (efficient attention variants: MQA, GQA, MLA, NSA; sparse Mixture-of-Experts (MoE)), (2) fine-tuning (parameter-efficient methods: LoRA, RSLoRA, DoRA), and (3) inference (quantization methods: int4, float16). We define six fine-grained metrics (Memory Utilization, Compute Utilization, Latency, Throughput, Energy Consumption, Compression Rate) to capture hardware saturation, latency-throughput balance, and carbon cost. Evaluating over 100 model-technique pairs (0.5B-72B parameters), we derive three core insights: (i) Efficiency involves quantifiable trade-offs: no single method is universally optimal; e.g., MoE reduces FLOPs and improves accuracy but increases VRAM by 40%, while int4 quantization cuts memory/energy by up to 3.9x at a 3-5% accuracy drop. (ii) Optima are task- and scale-dependent: MQA offers optimal memory-latency trade-offs for constrained devices, MLA achieves lowest perplexity for quality-critical tasks, and RSLoRA surpasses LoRA efficiency only beyond 14B parameters. (iii) Techniques generalize across modalities: we extend evaluations to Large Vision Models (Stable Diffusion 3.5, Wan 2.1) and Vision-Language Models (Qwen2.5-VL), confirming effective transferability. By open-sourcing datasets, evaluation pipelines, and leaderboards, EfficientLLM provides essential guidance for researchers and engineers navigating the efficiency-performance landscape of next-generation foundation models.

View arXiv page View PDF Project page GitHub repository Add to collection

Community

Tyrannosaurus

Paper submitter about 6 hours ago

•

edited about 6 hours ago

Large Language Models (LLMs) have catalyzed dramatic progress, yet their ballooning parameter counts (e.g., Deepseek R1 671B) and context windows pose prohibitive compute (∼3640 Petaflop/s-days for GPT-3 training), energy, and monetary footprints (> $4.6M est. for GPT-3). We introduce EfficientLLM, presenting a novel benchmark definition and the results of the first end-to-end, hundred-scale empirical study of efficiency techniques for LLMs. Executed on a production-class cluster (48 × GH200, 8 × H200 GPUs)—essential for accurately measuring real-world performance and energy trade-offs—our study is grounded in a unified three-axis taxonomy: architecture pretraining, fine-tuning, and inference. Specifically, we focus on these three aspects due to their direct practical implications for different stakeholders in the LLM lifecycle: (1) Architecture pretraining provides actionable insights for researchers and engineers designing new model architectures, enabling accurate budgeting of computational resources and energy costs; (2) Fine-tuning benchmarks guide practitioners who adapt pretrained base models to specific downstream tasks or domains, helping them select efficient parameter-efficient fine-tuning (PEFT) methods; (3) Bit-width quantization evaluations inform deployment engineers on how to effectively reduce serving costs and latency through quantization techniques that can be directly deployed without retraining. For architecture pretraining, we extensively evaluate efficient attention variants (MQA, GQA, MLA, NSA) and sparse Mixture-of Experts (MoE). For fine-tuning, we benchmark diverse PEFT methods (LoRA, RSLoRA, DoRA). For inference, we evaluate model compression methods, including post-training quantization down to int4 and float16. We utilize six orthogonal, fine-grained metrics (Average-Memory-Utilization, Peak-Compute-Utilization, Average-Latency, Average-Throughput, Average-Energy-Consumption, Model-Compression Rate) to jointly capture hardware saturation, latency–throughput balance, and carbon cost. Our benchmark evaluates over 100 model–technique pairs, covering 0.5B–72B parameter LLMs, yielding three core insights: (i) Efficiency Involves Quantifiable Trade offs: No single method is universally optimal; every technique improves at least one metric while regressing another. For instance, MoE trims FLOPs and lifts accuracy but inflates VRAM by 40%, whereas int4 quantization cuts memory/energy by up to 3.9× at a measured 3–5% average task score drop. (ii) Optima are Task- and Scale-Dependent: Efficiency optima are highly context-dependent. MQA offers the best memory–latency frontier for constrained devices, MLA yields the lowest perplexity for quality-critical tasks, and RSLoRA surpasses LoRA’s efficiency only beyond 14B parameters, highlighting complex interactions between task, scale, and hardware. (iii) Broad Applicability Across Modalities: We extended our evaluation framework to Large Vision Models (LVMs) and Vision-Language Models (VLMs), applying the same efficiency techniques to models like Stable Diffusion 3.5, Wan 2.1, and Qwen2.5-VL. Techniques validated on LLMs transfer effectively, with MQA/GQA improving LVM generation quality (FID scores) and PEFT methods achieving strong performance-efficiency trade-offs. Our study provides comprehensive insights into these selected aspects, while other important efficiency-related topics, such as training infrastructure optimization, reinforcement learning for post-training alignment, and test-time scaling strategies, are beyond the scope of this paper. We briefly review these additional directions in the related work section and highlight them as promising avenues for future exploration. By open-sourcing datasets, evaluation pipelines, and leaderboards, EfficientLLM provides a crucial compass for academics and engineers navigating the efficiency–performance landscape of next-generation foundation models.