Enabling Efficient Serverless Inference Serving for LLM (Large Language Model) in the Cloud
Abstract
This review report discusses the cold start latency in serverless inference and existing solutions. It particularly reviews the ServerlessLLM method, a system designed to address the cold start problem in serverless inference for large language models. Traditional serverless approaches struggle with high latency due to the size of LLM checkpoints and the overhead of initializing GPU resources. ServerlessLLM introduces a multitier checkpoint loading system, leveraging underutilized GPU memory and storage to reduce startup times by 6--8x compared to existing methods. It also proposes live inference migration and a startup-time-optimized model scheduler, ensuring efficient resource allocation and minimizing delays. This system significantly improves performance and scalability in serverless environments for LLM workloads. Besides ServerlessLLM, several other methods from recent research literature, including Rainbowcake, are reviewed in this paper. Further discussions explore how FaaS providers tackle cold starts and the possible future scopes.
- Publication:
-
arXiv e-prints
- Pub Date:
- November 2024
- arXiv:
- arXiv:2411.15664
- Bibcode:
- 2024arXiv241115664G
- Keywords:
-
- Computer Science - Distributed, Parallel, and Cluster Computing;
- Computer Science - Machine Learning
- E-Print:
- 12 pages, 7 figures, TUM Cloud Computing Seminar