Authors and References
The AI-based Scalable Analytics for Enhancing Performance, Resilience, and Security of HPC Systems is a
collaborative project in partnership with Sandia National
Laboratory and led by Prof. Ayse K. Coskun
from Boston University. This groundbreaking initiative aims to leverage machine learning techniques to
diagnose anomalies in High-Performance Computing (HPC) Systems, thereby improving their overall
performance, resilience, and security.
- Burak Aksar, Efe Sencan, Benjamin Schwaller, Omar Aaziz, Vitus j. Leung, Jim Brandt, Brian Kulis, Manuel Egele, and Ayse K. Coskun. Prodigy: Towards Unsupervised Anomaly Detection in Production HPC Systems. To appear in The International Conference on High Performance Computing, Network, Storage, and Analysis (SC 2023), Nov. 2023. PDF
- Ozan Tuncer, Emre Ates, Yijia Zhang, Ata Turk, Jim Brandt, Vitus Leung, Manuel Egele, and Ayse K. Coskun. Diagnosing Performance Variations in HPC Applications using Machine Learning. In International Supercomputing Conference (ISC-HPC), pp. 355-373, Jun. 2017. PDF
- Ozan Tuncer, Emre Ates, Yijia Zhang, Ata Turk, Jim Brandt, Vitus J. Leung, Manuel Egele, and Ayse K. Coskun. Online Diagnosis of Performance Variation in HPC Systems Using Machine Learning, in IEEE Transactions on Parallel and Distributed Systems (TPDS), vol. 30, no. 4, pp. 883-896, April 2019. PDF
- Burak Aksar, Benjamin Schwaller, Omar Aaziz, Vitus J. Leung, Jim Brandt, Manuel Egele, and Ayse K. Coskun. E2EWatch: An End-to-end Anomaly Diagnosis Framework for Production HPC Systems. In International European Conference on Parallel and Distributed Computing (Euro-Par), August 2021. PDF Github
- Emre Ates, Yijia Zhang, Burak Aksar, Jim Brandt, Vitus J. Leung, Manuel Egele, and Ayse K. Coskun. HPAS: An HPC Performance Anomaly Suite for Reproducing Performance Variations. In International Conference on Parallel Processing (ICPP 2019), pp. 1-10, Aug. 2019. PDF Github
- Burak Aksar, Yijia Zhang , Emre Ates, Benjamin Schwaller, Omar Aaziz, Vitus J. Leung, Jim Brandt, Manuel Egele, and Ayse K. Coskun. Proctor: A Semi-Supervised Performance Anomaly Diagnosis Framework for Production HPC Systems. In International Supercomputing Conference (ISC-HPC), June 2021. PDF Github
- Burak Aksar, Efe Sencan, Benjamin Schwaller, Omar Aaziz, Vitus J. Leung, Jim Brandt, Brian Kulis, and Ayse K. Coskun. ALBADross: Active Learning Based Anomaly Diagnosis for Production HPC Systems . In IEEE International Conference on Cluster Computing (Cluster) , July 2022. PDF Github
The web-based framework for anomaly detection has been developed by:
Yin-Ching (William) Lee, Burak Aksar, Efe Sencan, Professor Ayse K. Coskun
For questions and feedback, contact Efe Sencan (esencan@bu.edu).
Project Team From Boston University
Professor Ayse Coskun
Professor Manuel Egele
Professor Brian Kulis
Burak Aksar
Efe Sencan
Yin-Ching (William) Lee
Project Team From Sandia National Laboratory
Jim Brandt
Vitus J. Leung
Benjamin Schwaller