Teola: Towards End-to-End Optimization of LLM-based Applications

doi:10.48550/arXiv.2407.00326

Teola: Towards End-to-End Optimization of LLM-based Applications

Large language model (LLM)-based applications consist of both LLM and non-LLM components, each contributing to the end-to-end latency. Despite great efforts to optimize LLM inference, end-to-end workflow optimization has been overlooked. Existing frameworks employ coarse-grained orchestration with task modules, which confines optimizations to within each module and yields suboptimal scheduling decisions. We propose fine-grained end-to-end orchestration, which utilizes task primitives as the basic units and represents each query's workflow as a primitive-level dataflow graph. This explicitly exposes a much larger design space, enables optimizations in parallelization and pipelining across primitives of different modules, and enhances scheduling to improve application-level performance. We build Teola, a novel orchestration framework for LLM-based applications that implements this scheme. Comprehensive experiments show that Teola can achieve up to 2.09x speedup over existing systems across various popular LLM applications.

Publication:

arXiv e-prints

Pub Date:

June 2024

DOI:

10.48550/arXiv.2407.00326

arXiv:

arXiv:2407.00326

Bibcode:

2024arXiv240700326T

Keywords:

Computer Science - Distributed;
Parallel;
and Cluster Computing;
Computer Science - Artificial Intelligence;
Computer Science - Networking and Internet Architecture

NASA/ADS

Teola: Towards End-to-End Optimization of LLM-based Applications

Abstract