Sep 23 – 27, 2024
ESRF Auditorium
Europe/Paris timezone

Streamlining Scientific Discovery with Data Pipelines at the Advanced Photon Source

Sep 26, 2024, 4:40 PM
15m
Hybrid event (ESRF Auditorium)

Hybrid event

ESRF Auditorium

EPN Campus ESRF - ILL 71 Av. des Martyrs, 38000 Grenoble
Talk Workflow engines Workflow engines

Speaker

Hannah Parraga

Description

Data are essential to the scientific discoveries enabled by experiments performed at the APS. When the facility resumes operation this year, it will generate an estimated 100PB of raw experimental data per year from its seventy-two operating beamlines that house over 100 unique instruments. This data is generated as a part of over 6,000 annual experiments performed by over 5,500 facility users each year. The amount of data generated at the APS will increase due to the newly upgraded storage ring and beamline advances, such as new measurement techniques, technological advances in detectors and instrumentation, multi-modal instruments that can acquire several measurements in a single experiment, and advanced data processing algorithms. This trend is expected to continue in the future.

As a scientific user facility, the APS presents several unique challenges for data management. Each beamline differs and can have multiple experiment techniques, types of detectors, data rates, data
formats, operating systems, and processing workflows. Additionally, the users themselves will also vary. They come from different companies, universities, and research institutions, but all must be able to
access their data after leaving the lab. They may want their data immediately or several years after it is created. They may be conducting experiments independently and remotely, or in person with close
involvement by beamline staff. Beamline staff have different levels of technical experience. Some desire a hands-off approach to data management and some want the flexibility to program their own custom
tools. They use computers with a variety of operating systems. The APS must have a data management solution which works for each of these unique beamlines.

To address these challenges, workflows have been deployed at several beamlines which automate the data lifecycle from detector to data portal. These workflows integrate the Bluesky controls software, the
APS Data Management System, and Globus services to provide infrastructure which is agnostic to analysis technique, compute resource, and storage location. The reusability and flexibility of the
configuration driven approach allows new analysis techniques to be supported with minimal development effort. Common analysis packages are supported which are used across multiple beamlines for X-ray techniques such as ptychography, crystallography, X-ray fluorescence microscopy,
tomography, X-ray photon correlation spectroscopy, and far-field high-energy diffraction microscopy, and Bragg coherent diffractive imaging. Raw data, metadata, and analysis results are secured to only users from a given experiment. Data portals provide functionality to reprocess datasets using different parameters.

Although these data pipelines are meeting many of the challenges, further development is under way of additional features to provide users with an exceptional data management experience. Additional analysis packages are being implemented for coherent surface scattering imaging, crystallography, and combined ptychography plus X-ray fluorescence microscopy. Furthermore, additional beamlines are in the process of being onboarded to make use of the pipelines which have already been developed. Looking to the future, the existing data portal reprocessing features will be expanded to include data management tasks from any stage in the data lifecycle.

*Work supported by U.S. Department of Energy, Office of Science, under Contract No. DE-AC02-
06CH11357.

Abstract publication I agree that the abstract will be published on the web site

Primary authors

Presentation materials