Sep 23 – 27, 2024
ESRF Auditorium
Europe/Paris timezone

Tools for predicting and responding to anomalies in experiment data streams

Sep 26, 2024, 4:55 PM
15m
Hybrid event (ESRF Auditorium)

Hybrid event

ESRF Auditorium

EPN Campus ESRF - ILL 71 Av. des Martyrs, 38000 Grenoble
Talk Workflow engines Workflow engines

Speaker

Justin Wozniak (Argonne National Laboratory)

Description

Building resilient data streams for large-scale experiments is a critical problem in modern settings in which advanced computing techniques are more tightly integrated with data collection activities. Resilience-aware application solutions must include 1) policy management, in which science-level goals are presented to the system; 2) data movement telemetry, which captures system responses; and 3) systems-level predictive and adaptive anomaly mitigation, which integrates telemetry streams and provides actionable predictions to the policy manager. In this effort, we consider a previously-developed workflow application in which Advanced Photon Source (APS) data is produced by the detector and is automatically picked up by the APS Data Management system which uploads it to APS central storage, then transfers the data to an HPC storage system at the Argonne Leadership Computing Facility, and triggers a job on supercomputer Polaris which reconstructs the scan. The goal is to apply models to simple anomaly prediction problems: for example, by applying machine learning (ML) methods to event streams to predict network degradation or reductions in available compute resources, and feed these predictions back to the policy component for application-level decisions. We can then link multiple such models (network, compute, applications, observation) into virtual infrastructure twins: i.e., predictive networks that approximate some aspect(s) of the real infrastructure and its applications. At the policy level, the system will then be able to generate alerts about imminent system congestion dynamics and fault-triggering conditions. As a tool, we want the system to be able to integrate interfaces that allow the use of a simple declarative notation to specify the stream(s) to be monitored and the model method(s) to be applied. This presentation will cover preliminary work in this effort, describe the raw data sources that can be integrated into telemetry streams for prediction, the selection and refinement of ML models for the task, and the quality of preliminary predictions from the system.

Abstract publication I agree that the abstract will be published on the web site

Primary authors

Justin Wozniak (Argonne National Laboratory) Nicholas Schwarz (Argonne National Laboratory) Michael Prince (Argonne National Laboratory) Dr Tong Shu (University of North Texas) Dr Bogdan Nicolae (Argonne National Laboratory)

Presentation materials