LUT-LLM: Efficient Large Language Model Inference with Memory-based Computations on FPGAs
Abstract
LUT-LLM, an FPGA accelerator, improves LLM inference efficiency by shifting computation to memory-based operations, achieving lower latency and higher energy efficiency compared to GPUs.
The rapid progress of large language models (LLMs) has advanced numerous applications, yet efficient single-batch inference remains vital for on-device intelligence. While FPGAs offer fine-grained data control and high energy efficiency, recent GPU optimizations have narrowed their advantage, especially under arithmetic-based computation. To overcome this, we leverage FPGAs' abundant on-chip memory to shift LLM inference from arithmetic- to memory-based computation through table lookups. We present LUT-LLM, the first FPGA accelerator enabling 1B+ LLM inference via vector-quantized memory operations. Our analysis identifies activation-weight co-quantization as the most effective scheme, supported by (1) bandwidth-aware parallel centroid search, (2) efficient 2D table lookups, and (3) a spatial-temporal hybrid design minimizing data caching. Implemented on an AMD V80 FPGA for a customized Qwen 3 1.7B model, LUT-LLM achieves 1.66x lower latency than AMD MI210 and 1.72x higher energy efficiency than NVIDIA A100, scaling to 32B models with 2.16x efficiency gain over A100.
Community
Code will be released soon.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- TeLLMe v2: An Efficient End-to-End Ternary LLM Prefill and Decode Accelerator with Table-Lookup Matmul on Edge FPGAs (2025)
- TENET: An Efficient Sparsity-Aware LUT-Centric Architecture for Ternary LLM Inference On Edge (2025)
- SAIL: SRAM-Accelerated LLM Inference System with Lookup-Table-based GEMV (2025)
- SpecMamba: Accelerating Mamba Inference on FPGA with Speculative Decoding (2025)
- HPIM: Heterogeneous Processing-In-Memory-based Accelerator for Large Language Models Inference (2025)
- AxLLM: accelerator architecture for large language models with computation reuse capability (2025)
- Scaling LLM Test-Time Compute with Mobile NPU on Smartphones (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper