A scalable architecture for online anomaly detection of WLCG batch jobs
Abstract
For data centres it is increasingly important to monitor the network usage, and learn from network usage patterns. Especially configuration issues or misbehaving batch jobs preventing a smooth operation need to be detected as early as possible. At the GridKa data and computing centre we therefore operate a tool BPNetMon for monitoring traffic data and characteristics of WLCG batch jobs and pilots locally on different worker nodes. On the one hand local information itself are not sufficient to detect anomalies for several reasons, e.g. the underlying job distribution on a single worker node might change or there might be a local misconfiguration. On the other hand a centralised anomaly detection approach does not scale regarding network communication as well as computational costs. We therefore propose a scalable architecture based on concepts of a super-peer network.
- Publication:
-
Journal of Physics Conference Series
- Pub Date:
- October 2016
- DOI:
- 10.1088/1742-6596/762/1/012002
- Bibcode:
- 2016JPhCS.762a2002K