Rethinking cloud abstractions for tenant-provider cooperative optimization of AI workloads
Abstract
AI workloads, often hosted in multi-tenant cloud environments, require vast computational resources but suffer inefficiencies due to limited tenant-provider coordination. Tenants lack infrastructure insights, while providers lack workload details to optimize tasks like partitioning, scheduling, and fault tolerance. We propose the HarmonAIze project to redefine cloud abstractions, enabling cooperative optimization for improved performance, efficiency, resiliency, and sustainability. This paper outlines key opportunities, challenges, and a research agenda to realize this vision.
- Publication:
-
arXiv e-prints
- Pub Date:
- January 2025
- arXiv:
- arXiv:2501.09562
- Bibcode:
- 2025arXiv250109562C
- Keywords:
-
- Computer Science - Distributed, Parallel, and Cluster Computing