Leveraging Speculative Sampling and KV-Cache Optimizations Together for Generative AI using OpenVINO
Abstract
Inference optimizations are critical for improving user experience and reducing infrastructure costs and power consumption. In this article, we illustrate a form of dynamic execution known as speculative sampling to reduce the overall latency of text generation and compare it with standard autoregressive sampling. This can be used together with model-based optimizations (e.g. quantization) to provide an optimized solution. Both sampling methods make use of KV caching. A Jupyter notebook and some sample executions are provided.
- Publication:
-
arXiv e-prints
- Pub Date:
- November 2023
- DOI:
- 10.48550/arXiv.2311.04951
- arXiv:
- arXiv:2311.04951
- Bibcode:
- 2023arXiv231104951B
- Keywords:
-
- Computer Science - Machine Learning;
- Computer Science - Artificial Intelligence;
- Computer Science - Performance
- E-Print:
- Code available at https://github.com/openvinotoolkit/openvino_notebooks/tree/latest/notebooks/speculative-sampling