Home

Instructions:

A user has two options for interacting with the web application:

Sample telemetry data: This option is suggested if you want to see the framework in action quickly. The sample data includes healthy and anomalous telemetry data collected from different compute nodes, allowing you to investigate the anomaly diagnosis results.

The sample dataset was collected from Volta, a Sandia National Laboratories-based Cray XC30m testbed supercomputer. Volta comprises 52 computing nodes organized into 13 connected switches, with each switch containing four nodes. Each node features 64GB of memory and two sockets, each equipped with an Intel Xeon E5-2695 v2 CPU featuring 12 two-way hyper-threaded cores.
There are 10 unique job ids and 20 unique compute node ids in the sample dataset. Each job id is associated with 2 unique compute node id.
In this sample dataset, it contains 6 healthy application runs and 14 anomalous application run.

Upload your own data Based on the data uploaded, the framework will perform the necessary data transformation steps and generate the anomaly diagnosis results. Currently, we support telemetry data collected via the Lightweight Distributed Metric Service.

Input Data Format:

We provide an example CSV file for the input data format. The first three columns are metadata columns: job_id, node_id, and timestamp. Generally, a scheduler assigns a unique job_id to each job and non-unique node_ids when an application is running. Since LDMS collects telemetry data from each compute node, and our framework provides anomaly diagnosis results for each compute node, we use the combination of job_id and node_id as a unique identifier.

The remaining column names are metrics collected via LDMS, such as MemTotal::meminfo, processes::procstat, and compact_pagemigrate_failed::vmstat. The first part of the column name is the metric name, and the second part indicates the subsystem from which it is collected.

Output:

The first section provides a comprehensive overview of the results. It includes the number of unique job_ids and the percentage breakdown of each detected anomaly type in the uploaded telemetry dataset. Additionally, we present the top 5 most influential features identified by the trained model.

Following the overview, users have the flexibility to delve into specific details by selecting a combination of job_id and node_id. This allows them to explore the results in greater depth, such as examining the ratio of each anomaly within the entire dataset. Furthermore, users can access the top 5 most significant features as determined by the trained model.

Moreover, a comparison between two metrics is available. On the left, the diagram displays the metric of the selected node id, providing a focused view. On the right, the diagram showcases the metrics of the healthy node data corresponding to the selected application type, enabling a valuable comparative analysis.

1. Choose a framework:

Supervised Framework

Unsupervised Framework

2. Choose an option:

SAMPLE DATA

UPLOAD YOUR DATA

Instructions:

Input Data Format:

Output:

1. Choose a framework:

2. Choose an option:

3. Begin Anomaly Diagnosis