FP6-LLM: Efficiently Serving Large Language Models Through FP6-Centric Algorithm-System Co-Design
Abstract
Six-bit quantization (FP6) can effectively reduce the size of large language models (LLMs) and preserve the model quality consistently across varied applications. However, existing systems do not provide Tensor Core support for FP6 quantization and struggle to achieve practical performance improvements during LLM inference. It is challenging to support FP6 quantization on GPUs due to (1) unfriendly memory access of model weights with irregular bit-width and (2) high runtime overhead of weight de-quantization. To address these problems, we propose TC-FPx, the first full-stack GPU kernel design scheme with unified Tensor Core support of float-point weights for various quantization bit-width. We integrate TC-FPx kernel into an existing inference system, providing new end-to-end support (called FP6-LLM) for quantized LLM inference, where better trade-offs between inference cost and model quality are achieved. Experiments show that FP6-LLM enables the inference of LLaMA-70b using only a single GPU, achieving 1.69x-2.65x higher normalized inference throughput than the FP16 baseline. The source code will be publicly available soon.
Community
how does the speed compare to FP8?
When Microsoft got involved, I figured we'd have a nice.msi or.exe one-click installer for Windows. Instead they seem to forgotten or given up on Windows, and it's all Python PIP this Numpy Wheels that Ubuntu blah blah blah. Earth calling Microsoft. You seem to have forgotten your identity. It's time come home.
How does this compare to 4-bit and 5-bit quantization methods? Why 6?
How does this compare to 4-bit and 5-bit quantization methods? Why 6?
According to the paper 4 bit is fine for short contexts, but starts to loose accuracy when the context grows. They've determined 6 bit to be the optimal precision for the majority of tasks. Which is kind of annoying because it would make a 33B model 24.75GB
I'll stick to 5 bit for medium size models, and use 6 bit for smaller models
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Enabling Fast 2-bit LLM on GPUs: Memory Alignment and Asynchronous Dequantization (2023)
- ZeroQuant(4+2): Redefining LLMs Quantization with a New FP6-Centric Strategy for Diverse Generative Tasks (2023)
- FlightLLM: Efficient Large Language Model Inference with a Complete Mapping Flow on FPGAs (2024)
- Agile-Quant: Activation-Guided Quantization for Faster Inference of LLMs on the Edge (2023)
- SPT: Fine-Tuning Transformer-based Language Models Efficiently with Sparsification (2023)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
how does the speed compare to FP8?
FP6-LLM is for weight-only quantization (i.e., W6A16) and uses FP16 Tensor Core for computation. If the FP8 quantization is also weight-only (i.e., W8A16), it will require more memory traffic than the W6A16 and is expected to have higher memory access overhead. If the FP8 is W8A8 and utilizing FP8 Tensor Core, that will be another story.
How does this compare to 4-bit and 5-bit quantization methods? Why 6?
From a system perspective, the methods in FP6-LLM are applicable to other bit-widths (e.g., 5-bit).
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper