Speaker
Description
Building resilient data streams for large-scale experiments is a critical problem in modern settings in which advanced computing techniques are more tightly integrated with data collection activities. Resilience-aware application solutions must include 1) policy management, in which science-level goals are presented to the system; 2) data movement telemetry, which captures system responses; and 3) systems-level predictive and adaptive anomaly mitigation, which integrates telemetry streams and provides actionable predictions to the policy manager. In this effort, we consider a previously-developed workflow application in which Advanced Photon Source (APS) data is produced by the detector and is automatically picked up by the APS Data Management system which uploads it to APS central storage, then transfers the data to an HPC storage system at the Argonne Leadership Computing Facility, and triggers a job on supercomputer Polaris which reconstructs the scan. The goal is to apply models to simple anomaly prediction problems: for example, by applying machine learning (ML) methods to event streams to predict network degradation or reductions in available compute resources, and feed these predictions back to the policy component for application-level decisions. We can then link multiple such models (network, compute, applications, observation) into virtual infrastructure twins: i.e., predictive networks that approximate some aspect(s) of the real infrastructure and its applications. At the policy level, the system will then be able to generate alerts about imminent system congestion dynamics and fault-triggering conditions. As a tool, we want the system to be able to integrate interfaces that allow the use of a simple declarative notation to specify the stream(s) to be monitored and the model method(s) to be applied. This presentation will cover preliminary work in this effort, describe the raw data sources that can be integrated into telemetry streams for prediction, the selection and refinement of ML models for the task, and the quality of preliminary predictions from the system.
Abstract publication | I agree that the abstract will be published on the web site |
---|