Instructions:
A user has two options for interacting with the web application:
-
Sample telemetry data:
This option is suggested if you want to see the framework in action quickly. The sample data includes
healthy and anomalous telemetry data collected from different compute nodes, allowing you to investigate
the anomaly diagnosis results.
- The sample dataset was collected from Volta, a Sandia National Laboratories-based Cray XC30m testbed
supercomputer. Volta comprises 52 computing nodes organized into 13 connected switches, with each
switch containing four nodes. Each node features 64GB of memory and two sockets, each equipped with
an Intel Xeon E5-2695 v2 CPU featuring 12 two-way hyper-threaded cores.
- There are 10 unique job ids and 20 unique compute node ids in the sample dataset. Each job id is
associated with 2 unique compute node id.
- In this sample dataset, it contains 6 healthy application runs and 14 anomalous
application run.
-
Upload your own data
Based on the data uploaded, the framework will perform the necessary data transformation steps and
generate the anomaly diagnosis results. Currently, we support telemetry data collected via the Lightweight Distributed
Metric Service.
Input Data Format:
We provide
an
example CSV file for the input
data format. The first three columns are metadata columns: job_id, node_id, and timestamp. Generally, a
scheduler assigns a unique job_id to each job and non-unique node_ids when an application is running. Since LDMS
collects telemetry data from each compute node, and our framework provides anomaly diagnosis results for each
compute node, we use the combination of job_id and node_id as a unique identifier.
The remaining column names are metrics collected via LDMS, such as MemTotal::meminfo, processes::procstat, and
compact_pagemigrate_failed::vmstat. The first part of the column name is the metric name, and the second part
indicates the subsystem from which it is collected.
Output:
The first section provides a comprehensive overview of the results. It includes the number of unique job_ids and
the percentage breakdown of each detected anomaly type in the uploaded telemetry dataset. Additionally, we
present the top 5 most influential features identified by the trained model.
Following the overview, users have the flexibility to delve into specific details by selecting a combination of
job_id and node_id. This allows them to explore the results in greater depth, such as examining the ratio of
each anomaly within the entire dataset. Furthermore, users can access the top 5 most significant features as
determined by the trained model.
Moreover, a comparison between two metrics is available. On the left, the diagram displays the metric of the
selected node id, providing a focused view. On the right, the diagram showcases the metrics of the healthy node
data corresponding to the selected application type, enabling a valuable comparative analysis.