ChaosEater: Fully Automating Chaos Engineering with Large Language Models
Abstract
Chaos Engineering (CE) is an engineering technique aimed at improving the resiliency of distributed systems. It involves artificially injecting specific failures into a distributed system and observing its behavior in response. Based on the observation, the system can be proactively improved to handle those failures. Recent CE tools realize the automated execution of predefined CE experiments. However, defining these experiments and reconfiguring the system after the experiments still remain manual. To reduce the costs of the manual operations, we propose \textsc{ChaosEater}, a \textit{system} for automating the entire CE operations with Large Language Models (LLMs). It pre-defines the general flow according to the systematic CE cycle and assigns subdivided operations within the flow to LLMs. We assume systems based on Infrastructure as Code (IaC), wherein the system configurations and artificial failures are managed through code. Hence, the LLMs' operations in our \textit{system} correspond to software engineering tasks, including requirement definition, code generation and debugging, and testing. We validate our \textit{system} through case studies on both small and large systems. The results demonstrate that our \textit{system} significantly reduces both time and monetary costs while completing reasonable single CE cycles.
- Publication:
-
arXiv e-prints
- Pub Date:
- January 2025
- arXiv:
- arXiv:2501.11107
- Bibcode:
- 2025arXiv250111107K
- Keywords:
-
- Computer Science - Software Engineering;
- Computer Science - Artificial Intelligence;
- Computer Science - Computation and Language;
- Computer Science - Distributed, Parallel, and Cluster Computing;
- Computer Science - Networking and Internet Architecture
- E-Print:
- 138 pages (12 main), 10 figures. Project page: https://ntt-dkiku.github.io/chaos-eater