TraceMesh: Scalable and Streaming Sampling for Distributed Traces
Abstract
Distributed tracing serves as a fundamental element in the monitoring of cloud-based and datacenter systems. It provides visibility into the full lifecycle of a request or operation across multiple services, which is essential for understanding system dependencies and performance bottlenecks. To mitigate computational and storage overheads, most tracing frameworks adopt a uniform sampling strategy, which inevitably captures overlapping and redundant information. More advanced methods employ learning-based approaches to bias the sampling toward more informative traces. However, existing methods fall short of considering the high-dimensional and dynamic nature of trace data, which is essential for the production deployment of trace sampling. To address these practical challenges, in this paper we present TraceMesh, a scalable and streaming sampler for distributed traces. TraceMesh employs Locality-Sensitivity Hashing (LSH) to improve sampling efficiency by projecting traces into a low-dimensional space while preserving their similarity. In this process, TraceMesh accommodates previously unseen trace features in a unified and streamlined way. Subsequently, TraceMesh samples traces through evolving clustering, which dynamically adjusts the sampling decision to avoid over-sampling of recurring traces. The proposed method is evaluated with trace data collected from both open-source microservice benchmarks and production service systems. Experimental results demonstrate that TraceMesh outperforms state-of-the-art methods by a significant margin in both sampling accuracy and efficiency.
- Publication:
-
arXiv e-prints
- Pub Date:
- June 2024
- DOI:
- 10.48550/arXiv.2406.06975
- arXiv:
- arXiv:2406.06975
- Bibcode:
- 2024arXiv240606975C
- Keywords:
-
- Computer Science - Distributed;
- Parallel;
- and Cluster Computing;
- Computer Science - Software Engineering
- E-Print:
- Accepted by The 2024 IEEE 17th International Conference on Cloud Computing (CLOUD)