DuoAttention: Efficient Long-Context LLM Inference with Retrieval and Streaming Heads Paper • 2410.10819 • Published Oct 14, 2024 • 8
AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration Paper • 2306.00978 • Published Jun 1, 2023 • 10
Quest: Query-Aware Sparsity for Efficient Long-Context LLM Inference Paper • 2406.10774 • Published Jun 16, 2024 • 3
OliVe: Accelerating Large Language Models via Hardware-friendly Outlier-Victim Pair Quantization Paper • 2304.07493 • Published Apr 15, 2023 • 1