Cached Transformers: Improving Transformers with Differentiable Memory Cache

doi:10.48550/arXiv.2312.12742

Cached Transformers: Improving Transformers with Differentiable Memory Cache

This work introduces a new Transformer model called Cached Transformer, which uses Gated Recurrent Cached (GRC) attention to extend the self-attention mechanism with a differentiable memory cache of tokens. GRC attention enables attending to both past and current tokens, increasing the receptive field of attention and allowing for exploring long-range dependencies. By utilizing a recurrent gating unit to continuously update the cache, our model achieves significant advancements in \textbf{six} language and vision tasks, including language modeling, machine translation, ListOPs, image classification, object detection, and instance segmentation. Furthermore, our approach surpasses previous memory-based techniques in tasks such as language modeling and displays the ability to be applied to a broader range of situations.

Publication:

arXiv e-prints

Pub Date:

December 2023

DOI:

10.48550/arXiv.2312.12742

arXiv:

arXiv:2312.12742

Bibcode:

2023arXiv231212742Z

Keywords:

Computer Science - Computer Vision and Pattern Recognition

E-Print:

AAAI 2024

NASA/ADS

Cached Transformers: Improving Transformers with Differentiable Memory Cache

Abstract