Authors and References

The AI-based Scalable Analytics for Enhancing Performance, Resilience, and Security of HPC Systems is a collaborative project in partnership with Sandia National Laboratory and led by Prof. Ayse K. Coskun from Boston University. This groundbreaking initiative aims to leverage machine learning techniques to diagnose anomalies in High-Performance Computing (HPC) Systems, thereby improving their overall performance, resilience, and security.

  1. Burak Aksar, Efe Sencan, Benjamin Schwaller, Omar Aaziz, Vitus j. Leung, Jim Brandt, Brian Kulis, Manuel Egele, and Ayse K. Coskun. Prodigy: Towards Unsupervised Anomaly Detection in Production HPC Systems. To appear in The International Conference on High Performance Computing, Network, Storage, and Analysis (SC 2023), Nov. 2023. PDF PDF Icon
  2. Ozan Tuncer, Emre Ates, Yijia Zhang, Ata Turk, Jim Brandt, Vitus Leung, Manuel Egele, and Ayse K. Coskun. Diagnosing Performance Variations in HPC Applications using Machine Learning. In International Supercomputing Conference (ISC-HPC), pp. 355-373, Jun. 2017. PDF PDF Icon
  3. Ozan Tuncer, Emre Ates, Yijia Zhang, Ata Turk, Jim Brandt, Vitus J. Leung, Manuel Egele, and Ayse K. Coskun. Online Diagnosis of Performance Variation in HPC Systems Using Machine Learning, in IEEE Transactions on Parallel and Distributed Systems (TPDS), vol. 30, no. 4, pp. 883-896, April 2019. PDF PDF Icon
  4. Burak Aksar, Benjamin Schwaller, Omar Aaziz, Vitus J. Leung, Jim Brandt, Manuel Egele, and Ayse K. Coskun. E2EWatch: An End-to-end Anomaly Diagnosis Framework for Production HPC Systems. In International European Conference on Parallel and Distributed Computing (Euro-Par), August 2021. PDF PDF Icon Github github Icon
  5. Emre Ates, Yijia Zhang, Burak Aksar, Jim Brandt, Vitus J. Leung, Manuel Egele, and Ayse K. Coskun. HPAS: An HPC Performance Anomaly Suite for Reproducing Performance Variations. In International Conference on Parallel Processing (ICPP 2019), pp. 1-10, Aug. 2019. PDF PDF Icon Github github Icon
  6. Burak Aksar, Yijia Zhang , Emre Ates, Benjamin Schwaller, Omar Aaziz, Vitus J. Leung, Jim Brandt, Manuel Egele, and Ayse K. Coskun. Proctor: A Semi-Supervised Performance Anomaly Diagnosis Framework for Production HPC Systems. In International Supercomputing Conference (ISC-HPC), June 2021. PDF PDF Icon Github github Icon
  7. Burak Aksar, Efe Sencan, Benjamin Schwaller, Omar Aaziz, Vitus J. Leung, Jim Brandt, Brian Kulis, and Ayse K. Coskun. ALBADross: Active Learning Based Anomaly Diagnosis for Production HPC Systems . In IEEE International Conference on Cluster Computing (Cluster) , July 2022. PDF PDF Icon Github github Icon
The web-based framework for anomaly detection has been developed by:
Yin-Ching (William) Lee, Burak Aksar, Efe Sencan, Professor Ayse K. Coskun
For questions and feedback, contact Efe Sencan (esencan@bu.edu).

Project Team From Boston University

Ayse_Coskun

Professor Ayse Coskun

Manuel_Egele

Professor Manuel Egele

Brian_Kulis

Professor Brian Kulis

Burak_Aksar

Burak Aksar

Efe_Sencan

Efe Sencan

William_lee

Yin-Ching (William) Lee

Project Team From Sandia National Laboratory

Jim_Brandt

Jim Brandt

Vitus_J_Leung

Vitus J. Leung

Benjamin_Schwaller

Benjamin Schwaller

Sadia
BU