LServe: Efficient Long-sequence LLM Serving with Unified Sparse Attention Paper • 2502.14866 • Published Feb 20 • 13
DuoAttention: Efficient Long-Context LLM Inference with Retrieval and Streaming Heads Paper • 2410.10819 • Published Oct 14, 2024 • 8
Quest: Query-Aware Sparsity for Efficient Long-Context LLM Inference Paper • 2406.10774 • Published Jun 16, 2024 • 3
QServe: W4A8KV4 Quantization and System Co-design for Efficient LLM Serving Paper • 2405.04532 • Published May 7, 2024
Retrieval Head Mechanistically Explains Long-Context Factuality Paper • 2404.15574 • Published Apr 24, 2024 • 3
InfLLM: Unveiling the Intrinsic Capacity of LLMs for Understanding Extremely Long Sequences with Training-Free Memory Paper • 2402.04617 • Published Feb 7, 2024 • 4
SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models Paper • 2211.10438 • Published Nov 18, 2022 • 6
Efficient Streaming Language Models with Attention Sinks Paper • 2309.17453 • Published Sep 29, 2023 • 14
FastComposer: Tuning-Free Multi-Subject Image Generation with Localized Attention Paper • 2305.10431 • Published May 17, 2023 • 2