Simulating Hard Attention Using Soft Attention

doi:10.48550/arXiv.2412.09925

Simulating Hard Attention Using Soft Attention

We study conditions under which transformers using soft attention can simulate hard attention, that is, effectively focus all attention on a subset of positions. First, we examine several variants of linear temporal logic, whose formulas have been previously been shown to be computable using hard attention transformers. We demonstrate how soft attention transformers can compute formulas of these logics using unbounded positional embeddings or temperature scaling. Second, we demonstrate how temperature scaling allows softmax transformers to simulate a large subclass of average-hard attention transformers, those that have what we call the uniform-tieless property.

Publication:

arXiv e-prints

Pub Date:

December 2024

DOI:

10.48550/arXiv.2412.09925

arXiv:

arXiv:2412.09925

Bibcode:

2024arXiv241209925Y

Keywords:

Computer Science - Machine Learning;
Computer Science - Computation and Language;
Computer Science - Formal Languages and Automata Theory

ADS

Simulating Hard Attention Using Soft Attention

Abstract