NOBUGS 2024

Name: NOBUGS 2024
Start: 2024-09-23T08:30:00+02:00
End: 2024-09-27T18:00:00+02:00
Location: ESRF Auditorium

Sep 23 – 27, 2024

ESRF Auditorium

Europe/Paris timezone

Contact

nobugs2024@esrf.fr

Tools for predicting and responding to anomalies in experiment data streams

Sep 26, 2024, 4:55 PM

15m

Hybrid event (ESRF Auditorium)

Hybrid event

ESRF Auditorium

EPN Campus ESRF - ILL 71 Av. des Martyrs, 38000 Grenoble

Talk Workflow engines Workflow engines

Justin Wozniak (Argonne National Laboratory)

Building resilient data streams for large-scale experiments is a critical problem in modern settings in which advanced computing techniques are more tightly integrated with data collection activities. Resilience-aware application solutions must include 1) policy management, in which science-level goals are presented to the system; 2) data movement telemetry, which captures system responses; and 3) systems-level predictive and adaptive anomaly mitigation, which integrates telemetry streams and provides actionable predictions to the policy manager. In this effort, we consider a previously-developed workflow application in which Advanced Photon Source (APS) data is produced by the detector and is automatically picked up by the APS Data Management system which uploads it to APS central storage, then transfers the data to an HPC storage system at the Argonne Leadership Computing Facility, and triggers a job on supercomputer Polaris which reconstructs the scan. The goal is to apply models to simple anomaly prediction problems: for example, by applying machine learning (ML) methods to event streams to predict network degradation or reductions in available compute resources, and feed these predictions back to the policy component for application-level decisions. We can then link multiple such models (network, compute, applications, observation) into virtual infrastructure twins: i.e., predictive networks that approximate some aspect(s) of the real infrastructure and its applications. At the policy level, the system will then be able to generate alerts about imminent system congestion dynamics and fault-triggering conditions. As a tool, we want the system to be able to integrate interfaces that allow the use of a simple declarative notation to specify the stream(s) to be monitored and the model method(s) to be applied. This presentation will cover preliminary work in this effort, describe the raw data sources that can be integrated into telemetry streams for prediction, the selection and refinement of ML models for the task, and the quality of preliminary predictions from the system.

Abstract publication	I agree that the abstract will be published on the web site

Justin Wozniak (Argonne National Laboratory) Nicholas Schwarz (Argonne National Laboratory) Michael Prince (Argonne National Laboratory) Dr Tong Shu (University of North Texas) Dr Bogdan Nicolae (Argonne National Laboratory)

diaspora.pptx

Video of Talk - 55 - Workflow Engines Justin Wozniak

NOBUGS 2024

Contact

Tools for predicting and responding to anomalies in experiment data streams

Hybrid event

ESRF Auditorium

Speaker

Description

Primary authors

Presentation materials

Choose timezone

NOBUGS 2024

Contact

Speaker

Description

Primary authors

Presentation materials