Teola: Towards End-to-End Optimization of LLM-based Applications
Abstract
Large language model (LLM)-based applications consist of both LLM and non-LLM components, each contributing to the end-to-end latency. Despite great efforts to optimize LLM inference, end-to-end workflow optimization has been overlooked. Existing frameworks employ coarse-grained orchestration with task modules, which confines optimizations to within each module and yields suboptimal scheduling decisions. We propose fine-grained end-to-end orchestration, which utilizes task primitives as the basic units and represents each query's workflow as a primitive-level dataflow graph. This explicitly exposes a much larger design space, enables optimizations in parallelization and pipelining across primitives of different modules, and enhances scheduling to improve application-level performance. We build Teola, a novel orchestration framework for LLM-based applications that implements this scheme. Comprehensive experiments show that Teola can achieve up to 2.09x speedup over existing systems across various popular LLM applications.
- Publication:
-
arXiv e-prints
- Pub Date:
- June 2024
- DOI:
- 10.48550/arXiv.2407.00326
- arXiv:
- arXiv:2407.00326
- Bibcode:
- 2024arXiv240700326T
- Keywords:
-
- Computer Science - Distributed;
- Parallel;
- and Cluster Computing;
- Computer Science - Artificial Intelligence;
- Computer Science - Networking and Internet Architecture