Star Attention: Efficient LLM Inference over Long Sequences
Abstract
Inference with Transformer-based Large Language Models (LLMs) on long sequences is both costly and slow due to the quadratic complexity of the self-attention mechanism. We introduce Star Attention, a two-phase block-sparse approximation that improves computational efficiency by sharding attention across multiple hosts while minimizing communication overhead. In the first phase, the context is processed using blockwise-local attention across hosts, in parallel. In the second phase, query and response tokens attend to all prior cached tokens through sequence-global attention. Star Attention integrates seamlessly with most Transformer-based LLMs trained with global attention, reducing memory requirements and inference time by up to 11x while preserving 95-100% of accuracy.
- Publication:
-
arXiv e-prints
- Pub Date:
- November 2024
- DOI:
- 10.48550/arXiv.2411.17116
- arXiv:
- arXiv:2411.17116
- Bibcode:
- 2024arXiv241117116A
- Keywords:
-
- Computer Science - Computation and Language;
- Computer Science - Artificial Intelligence;
- Computer Science - Machine Learning
- E-Print:
- Code: https://github.com/NVIDIA/Star-Attention