Star Attention: Efficient LLM Inference over Long Sequences

doi:10.48550/arXiv.2411.17116

Star Attention: Efficient LLM Inference over Long Sequences

Inference with Transformer-based Large Language Models (LLMs) on long sequences is both costly and slow due to the quadratic complexity of the self-attention mechanism. We introduce Star Attention, a two-phase block-sparse approximation that improves computational efficiency by sharding attention across multiple hosts while minimizing communication overhead. In the first phase, the context is processed using blockwise-local attention across hosts, in parallel. In the second phase, query and response tokens attend to all prior cached tokens through sequence-global attention. Star Attention integrates seamlessly with most Transformer-based LLMs trained with global attention, reducing memory requirements and inference time by up to 11x while preserving 95-100% of accuracy.

Publication:

arXiv e-prints

Pub Date:

November 2024

DOI:

10.48550/arXiv.2411.17116

arXiv:

arXiv:2411.17116

Bibcode:

2024arXiv241117116A

Keywords:

Computer Science - Computation and Language;
Computer Science - Artificial Intelligence;
Computer Science - Machine Learning

E-Print:

Code: https://github.com/NVIDIA/Star-Attention

NASA/ADS

Star Attention: Efficient LLM Inference over Long Sequences

Abstract