Fast and Robust Early-Exiting Framework for Autoregressive Language Models with Synchronized Parallel Decoding

doi:10.48550/arXiv.2310.05424

Fast and Robust Early-Exiting Framework for Autoregressive Language Models with Synchronized Parallel Decoding

To tackle the high inference latency exhibited by autoregressive language models, previous studies have proposed an early-exiting framework that allocates adaptive computation paths for each token based on the complexity of generating the subsequent token. However, we observed several shortcomings, including performance degradation caused by a state copying mechanism or numerous exit paths, and sensitivity to exit confidence thresholds. Consequently, we propose a Fast and Robust Early-Exiting (FREE) framework, which incorporates a shallow-deep module and a synchronized parallel decoding. Our framework enables faster inference by synchronizing the decoding process of the current token with previously stacked early-exited tokens. Furthermore, as parallel decoding allows us to observe predictions from both shallow and deep models, we present a novel adaptive threshold estimator that exploits a Beta mixture model to determine suitable confidence thresholds. We empirically demonstrated the superiority of our proposed framework on extensive generation tasks.

Publication:

arXiv e-prints

Pub Date:

October 2023

DOI:

10.48550/arXiv.2310.05424

arXiv:

arXiv:2310.05424

Bibcode:

2023arXiv231005424B

Keywords:

Computer Science - Computation and Language

E-Print:

EMNLP 2023 (Long)

NASA/ADS

Fast and Robust Early-Exiting Framework for Autoregressive Language Models with Synchronized Parallel Decoding

Abstract