Leveraging Speculative Sampling and KV-Cache Optimizations Together for Generative AI using OpenVINO

doi:10.48550/arXiv.2311.04951

Leveraging Speculative Sampling and KV-Cache Optimizations Together for Generative AI using OpenVINO

Inference optimizations are critical for improving user experience and reducing infrastructure costs and power consumption. In this article, we illustrate a form of dynamic execution known as speculative sampling to reduce the overall latency of text generation and compare it with standard autoregressive sampling. This can be used together with model-based optimizations (e.g. quantization) to provide an optimized solution. Both sampling methods make use of KV caching. A Jupyter notebook and some sample executions are provided.

Publication:

arXiv e-prints

Pub Date:

November 2023

DOI:

10.48550/arXiv.2311.04951

arXiv:

arXiv:2311.04951

Bibcode:

2023arXiv231104951B

Keywords:

Computer Science - Machine Learning;
Computer Science - Artificial Intelligence;
Computer Science - Performance

E-Print:

Code available at https://github.com/openvinotoolkit/openvino_notebooks/tree/latest/notebooks/speculative-sampling

NASA/ADS

Leveraging Speculative Sampling and KV-Cache Optimizations Together for Generative AI using OpenVINO

Abstract