key: cord-0187745-uhc9uz0u
authors: Versluis, Laurens; Cetin, Mehmet; Greeven, Caspar; Laursen, Kristian; Podareanu, Damian; Codreanu, Valeriu; Uta, Alexandru; Iosup, Alexandru
title: A Holistic Analysis of Datacenter Operations: Resource Usage, Energy, and Workload Characterization -- Extended Technical Report
date: 2021-07-25
journal: nan
DOI: nan
sha: efe021cac7a1af301357793364c5f5ce83891e27
doc_id: 187745
cord_uid: uhc9uz0u

Improving datacenter operations is vital for the digital society. We posit that doing so requires our community to shift, from operational aspects taken in isolation to holistic analysis of datacenter resources, energy, and workloads. In turn, this shift will require new analysis methods, and open-access, FAIR datasets with fine temporal and spatial granularity. We leverage in this work one of the (rare) public datasets providing fine-grained information on datacenter operations. Using it, we show strong evidence that fine-grained information reveals new operational aspects. We then propose a method for holistic analysis of datacenter operations, providing statistical characterization of node, energy, and workload aspects. We demonstrate the benefits of our holistic analysis method by applying it to the operations of a datacenter infrastructure with over 300 nodes. Our analysis reveals both generic and ML-specific aspects, and further details how the operational behavior of the datacenter changed during the 2020 COVID-19 pandemic. We make over 30 main observations, providing holistic insight into the long-term operation of a large-scale, public scientific infrastructure. We suggest such observations can help immediately with performance engineering tasks such as predicting future datacenter load, and also long-term with the design of datacenter infrastructure.

Datacenters have become the main computing infrastructure for the digital society [14] . Because datacenters are supporting increasingly more users and more sophisticated demands, their workloads are changing rapidly. Although our community has much data and knowledge about HPC and supercomputing workloads [18, 2] , we have relatively much less information on emerging workloads such as machine-learning, which seem to differ significantly from past workloads [35, 28] . We posit in this work that, to design the efficient datacenters of tomorrow, we need comprehensive yet low-level machine metrics from datacenters. Such metrics could be key to optimization [38] , performance analysis [39, 61] , and uncovering important phenomena [24] . Yet, comprehensive datasets of lowlevel datacenter metrics are rare [64, 58, 10, 49] . Commercial providers are reluctant to publish such datasets, for reasons that include the need for commercial secrecy, adherence to privacy legislation, and lack of strong incentives to compensate for the additional effort. Often, published datasets are collected over short periods of time, with coarse time-granularity, not including low-level machine metrics. Some datasets have hardware specifications and rack topologies omitted and values obfuscated through normalization or other processes; only coarse, narrowly focused analysis can result from them. In contrast, in this work we propose a method for holistic analysis of datacenter operations, and apply it to the only long-term, fine-grained, open-access dataset [33, 60] that is currently available in the community. This dataset contains long-term and fine-grained operational server metrics gathered from a scientific computing infrastructure over a period of nearly 8 months at 15-second intervals. We show that such data are key in understanding datacenter behavior and encourage all datacenter providers to release such data and join our effort.

We focus on addressing three key challenges related to holistic understanding of datacenter operations. First, the lack of work on diverse operational metrics. For decades, the community has successfully been optimizing computer systems only for the metrics we measured-e.g., throughput [57] , job completion time [63] , latency [12] , fairness [20] -and biased toward the workloads and behavior that have been open-sourced [18, 2, 64, 58, 10, 49, 65] . In Figure 1 , we show evidence motivating the need to capture diverse sets of datacenter metrics. Using the uniquely fine-grained open-source dataset, we perform an all-to-all correlation analysis on 300+ low-level metrics. To get a reasonable idea for the number of correlating pairs per day, we investigate 50 separate days. This number is sufficient to highlight that the number of correlations varies greatly, likely depending on the daily workload of the datacenter. This suggests that capturing only a few metrics is insufficient to get a comprehensive view on the datacenter operation, as most metrics cannot be reliably derived from another. Instead, to capture possibly vital information, we should aim to include as much data as possible, from hardware sensors, to operating systems, and further to the application level. (In Section 7.2, we assess the additional storage costs for fine-grained data sampling at 15-second intervals and, together with the datacenter operators that published the open-access dataset, we interpret the results as indicative that these costs are worthwhile for the additional insight they enable. ) We identify as the second main challenge the lack of holistic analysis methods, able to combine and then work on diverse operational metrics such as workload and machine metrics. Previous research already points out that large bodies of modern research might be biased toward the available datasets [7, 2] , and that effort to measure "one level deeper" is still missing [42] . Next to operational bias, this also results in understudying other metrics and limits our ability to fully understand large-scale computer systems. For example, only since the 2000s and more intensively only after the mid-2010s, has energy consumption become a focus point [11, 21] . In pioneering work in operational data analytics in the late-2010s, Bourassa et al. [6] propose to conduct extensive data collection and feed the results back into running datacenters for improving operations. Pioneering software infrastructures such as GUIDE [62] and DCDB Wintermute [40] take first steps in this direction. However, much more research is needed to understand the kinds of data and analysis feasible (and necessary) in this field. Similarly, many studies and available datasets focus only on computational aspects, e.g., [2, 65, 53, 18] , but details on the operation of machine-learning workloads on infrastructure equipped with GPUs (and, further, TPUs, FPGAs, and ASICs) are still scarce.

As the third main challenge, we consider the relative lack of relevant, fine-grained, and public datasets. In practice, collecting holistic data has been feasible at the scale of datacenters for nearly a decade, with distributed monitoring [34] , tracing [52] , and profiling [71] tools already being used in large-scale datacenters. Unfortunately, such data rarely leaves the premises of the datacenter operator. From the relatively few traces that are shared publicly, many are focused on important but specific kinds of workloads, such as tightly-coupled parallel jobs [18] , bags of tasks [26] , and workflows [65] . Other datasets only include a limited subset of metrics such as power consumption [44] , or high-level job information [43] . Only a handful of datasets include low-level server metrics, such as the Microsoft Azure serverless traces [49] or the Solvinity business-critical traces [50] . Recently, in 2020, the largest public infrastructure for scientific computing in the Netherlands has released as Findable, Accessible, Interoperable, Reusable (FAIR) [66] , open-access data a long-term, finegrained dataset about their operations [33, 60] . In this work, we take on the challenge of conducting the first holistic analysis of the datacenter operations captured by this dataset.

Addressing these challenges, we advocate for a holistic view of datacenter operations, with a four-fold contribution:

1. We analyze whether diverse operational metrics are actually needed (Section 2). We conduct a pair-wise correlation study across hundreds of server metrics, and analyze whether correlations are enough to capture datacenter behavior. Our results show strong evidence about the need for a more diverse set of metrics, to capture existing operational aspects.

2. Motivated by the need for diverse operational metrics, we propose a holistic method for the analysis of datacenter operations (Section 3). Our method considers information about machine usage, energy consumption, and incoming workload, and provides comprehensive statistical results.

3. We show the benefits of our method in understanding the long-term operations of a large public provider of scientific infrastructure (Sections 4-6). We provide the first holistic insights into a large-scale, fine-grained and, most importantly, public dataset [33, 60] -over 60 billion datapoints collected over 7.5 months at the high frequency of 15 seconds. Unique features include a comparison of generic and machine-learning workloads and nodes, per-node analysis of power consumption and temperature, and glimpses at the COVID-19 period. 4. We explore ways to leverage holistic-analysis results to improve datacenter operation (Section 7). We first investigate short-term use, quantifying how higher-frequency information leads to better online load prediction. We propose actionable insights, assessing overheads of collecting more data and metric correlations. We also exemplify long-term use for design and tuning.

In this section we show that more metrics are needed when analyzing datacenter behavior, and thus also that more metrics should be recorded and shared. We reach this conclusion by analyzing correlations for a rich set of over 300 low-level metrics collected from a scientific and engineering datacenter (details in Section 3.1). Although the dataset includes low-level metrics collected by servers, OS, and applications, we focus in this section on metrics mostly as context-agnostic information, that is, without a structure or ontology that attaches them to specific datacenter components or processes. This allows us to understand whether more metrics can provide new information. (We address metrics in their context in our analysis method, in the next section.) Method overview: Correlations can lead to improvements in system monitoring and find interesting relationships for, e.g., predictions [67] . In particular, we are interested if all metrics in the dataset we consider are necessary or can be obtained from others through correlation, and if these correlations are persistent or workload-dependent. First, we compute all valid correlation-pairs during a day and inspect if pairs which are considered "very strong" by literature are persistent, as these pairs are the most likely candidates to reduce the size of the dataset through derivation and likely most robust. Second, we analyze visually the distribution and correlations of several commonly used high-level node metrics.

Conclusion: A small set of metrics cannot capture the information provided by diverse metrics. We urge datacenter practitioners to collect as much fine-grained data as possible for enabling valuable analyses, and to open-source such data for the benefit of all.

To observe if metrics are workload dependent, we compute the Pearson, Spearman, and Kendall correlations for all metric pairs, for 50 different days. This results in over 14,000 valid correlation pairs per day 1 . Next, we compute per day the number of pairs with "very strong" correlation, i.e., with Spearman coefficient ≥0.9 [48] . To verify that all coefficients are significant, we verify the probability of an uncorrelated system producing a correlation as extreme as in our dataset is negligible, i.e., all p-values of the pairs depicted in the figure are equal to 0 2 . Figure 1 depicts the number of correlated metric-pairs, per day. We observe the number of pairs fluctuates significantly, with only 40 pairs present on all days. This indicates that correlations, even very strong ones, change daily-because workloads are the most variable aspect of a datacenter, we conjecture correlations are workload-dependent. This suggests metric information should be collected across many metrics, and over long periods of time. Second, this shows, combined with observations from Section 2.2, that we cannot (significantly) reduce the amount of metrics, as many of these metrics cannot be reliably derived nor predicted from others.

To understand what correlations can reveal about the correlated metric-pairs, we depict in Figure 2 a correlation matrix of power usage, ambient temperature, number of processes running, amount of memory used, and UNIX load1. For all these metrics, the input dataset has valid data, so we are able to accurately compute all correlations.

The correlation matrix in Figure 2 includes: (i) distribution plots on the diagonal, (ii) pair-wise scatter plots and linear regressions in its sub-diagonal elements; (iii) pair-wise Pearson, Spearman, and Kendall correlations between metrics, in mirrored elements of (i).

Based on (i), we observe most metrics have a long tail. We also observe that the majority of values for temperature and somewhat for memory used is confined to a constrained area. From (ii), we observe that most combinations of metrics do not seem to have a linear relationship. Four pairs of metrics seem to have either some or a strong linear correlation. If we look at the Pearson, Spearman, and Kendall correlations corresponding to figures in (ii), we observe some additional insights going by the suggestion of Schober et al. to use these correlations as "a measure of the strength of the relationship" [48] . (All p-values are < 10 −13 , so the results are meaningful.) load1 and power usage seem to correlate somewhat going by the Spearman correlation which does not show in (ii), and second, the reverse shows for memory usage and number of running processes where relation in the regplot does not appear in any of the correlations. Even though one might assume a (seemingly) linear regression would also show in a ranking correlation such as Spearman, this does not always hold.

The temperature seems to correlate somewhat with the power usage of a node. This makes sense as initially as power usage increases, the temperature goes up due to heat dissipation. Naturally, the temperature eventually starts to stabilize and even goes down as the components within the system start to get cooled by the cooling system. The second pair that shows some relationship is memory used and number of processes running. The scatterplot shows a linear regression curve, yet as the data is not normal distributed, the Pearson correlation is close to 0. The Spearman correlation shows a moderate relationship between these two metrics. As can be observed from the regression curve, the amount of memory used does go up with the number of processes running, yet many outliers exist. The scatter plot shows that sometimes the amount of memory used in the system reaches the maximum when the number of processes running observed is low. We also observe that there are many measurements where the number of processes running is high, yet the total memory used by the node remains low, indicating that whenever many processes are running, they are lightweight in terms of memory usage. The third pair that shows a strong correlation is load1 and the number of processes running, with a Spearman coefficient of 0.82 and Kendall coefficient of 0.63. The load1 metric roughly depicts the average number of active processes in the last minute, which naturally should correlate well with the number of processes running. As the regression curve almost has a slope of 1, this indicates that bursts are infrequent; as the load1 is an average over the past minute and processes running is real-time, bursts may be dampened if they are very short running processes. We observe that many measurements are just below this curve, i.e., the number of processes running is higher than load1. This indicates that some processes are not awaiting resources and are e.g., suspended. We also observe some measurements where the load is higher than the number of processes running, this indicates a possible burst of short running processes that causes the average of load1 to spike, yet do not reflect in the current number of processes running. The other pairs of metrics do not seem to correlate, indicating that these combinations can't be used to predict a counterpart. This highlights the complexity of these systems and the difficulty in understanding the parameter space on a system's behavior.

We propose in this section a holistic analytical method for datacenter operations. In our method, obtaining a holistic view requires combining machine, energy, and workload data; doing so with long-term and fine-grained data enables meaningful findings. Our method is data-driven, and thus we address the input data (Section 3.1) and its cleanup (Section 3.2). The highlight of this section is the data analysis, for which we describe the main research questions and how we address them (Section 3.3). We also cover in this section the practical aspects (Section 3.4), e.g., the software and our provisions to ensure the reproducibility of this work, and the main limitations we see to our method (Section 3.5).

Although our method does not depend on specific metrics, we are mindful of the information currently available as public datasets. We take as model the public dataset with the finest temporal and spatial granularity available-a dataset open-sourced by the largest public provider of scientific infrastructure in the Netherlands. The datacenter operators have shared low-level server metrics collected every 15 seconds for a period of nearly 8 months [33, 60] .

Overall: Table 1 summarizes the public dataset: up to 1,26 million samples per metric per node, and in total 66 billion individual, high-and low-level metric measurements. The low-level metrics include server-level (e.g., power consumption), hardware-sensor (e.g., fan speeds, temperature), and OS-level metrics (e.g., system load).

Workload: The datacenter acts as infrastructure for over 800 users, who have submitted in the period captured by the dataset over 1 million jobs. The majority of these jobs originate from the bioinformatics, physics, computer science, chemistry, and machine learning domains. Jobs are exclusive per user; there are no multi-user jobs or workflows at the moment. SLURM is the cluster manager used to allow users to queue jobs for these different types of nodes. All jobs are scheduled using FIFO per stakeholder, with fairsharing across stakeholders. Through the use of queues, the datacenter offers both co-allocation of jobs on the same node and reserving of nodes for exclusive use. The operator uses cgroups to enforce CPU and memory limits on multi-tenant nodes. Infrastructure: In total, the datacenter contains 341 nodes spread across 20 racks. Racks are either generic, including nodes only with CPUs, or for machine learning (ML), including both CPUs and a number of GPUs per node. Over 90% of the workload on the GPU nodes is from the ML domain, a determination based on the libraries used by each job and later checked by the datacenter administrators Each rack includes up to 32 generic nodes or up to 7 ML nodes; the counts depend on GPU types and on power-consumption limitations imposed by the cooling system.

After inspecting the data, we inquired with the dataset creators about (in)valid and missing data, and, finally, created cleanup scripts. After careful data-cleanup, the dataset we use in this work is unprecedentedly rich, covering the operation of 15 racks containing 315 nodes with nearly 64 billion measurements, spanning over 7 months. To clean the dataset in Table 1 , we focus on:

Clean node-and rack-data: We include only the 315 nodes in 15 racks that are used for computation. Together, these nodes contain 5,352 CPU cores, 41.6 TB of CPU memory, 128 GPUs, and 1.8 TB of GPU memory. Most nodes (283) only contain CPUs; the others (32) also have GPUs attached.

Clean job-data: For the workload, we filter out the jobs based on their start time if they are outside the start and end time range of the dataset. Additionally, all jobs that are not related to the racks in the machine dataset are filtered out. These jobs originate from nodes in the 5 racks used as gateways for the public, as debug and testing resources, and as compile farms.

Clean metric-data: When performing numerical analyses, we removed the NaN values or set them to, e.g., zero when summing. Overall, the original dataset contains over 66 billion measurements, with close to 2.6 billion NaN values (3,85%). For some metrics, the dataset contains some gaps where the monitoring system was down; for some others, data collection stopped halfway into May 2020.

Clean time-series: We filter out all missing measurements (not-a-numbers, NaNs). In visual overviews, we mark missing data using special coloring.

Clean correlation-data: When computing correlations between pairs of metrics, we omit pairs where one or both metrics' measurements never change, because such data is unfit for the ranking step required to compute the Spearman and Kendall correlations. 

Our method for holistic analysis proposes diverse research questions, answered using fine-grained machine and workload data.

Machine and workload data: As main input dataset, we use the clean dataset introduced in Section 3.2. For the COVID-19 analysis, we record that the Dutch government declared the start of the (ongoing) pandemic on Feb 27, 2020 [15] ; we thus consider all data before this date to be "non-covid" data. For the workload analysis, the datacenter cannot publish the workload data due to privacy constraints (the EU GDPR law); instead, we contacted the datacenter operators and worked with them to run the analysis we need on the data.

Method FAIRness [66] : The scientific community is a powerful advocate for FAIR data. The dataset used in this work is FAIRly stewarded by Zenodo, and comes with a full specification and a data-schema that allow sharing and using the data with low effort [60] .

Novelty of our method: Previous work [2, 62, 40, 43] has performed individual analyses that align and overlap with our holistic analysis. However, the kind of analysis we propose in this work is novel through its all-encompassing scope and detail of the data: we analyze workload (e.g., jobs) data and fine-grained machine data, and show that this is needed to better understand job-machine interaction and to perform predictions. We present three types of research questions (RQ) addressed by our novel analysis and mark with a star ( ) the RQs which are not answered in any prior work : A. Analysis of machine operations (results in Section 4): To analyze how the datacenter machines behave over a long period of time, we use a variety of low-level metrics as input for answering the following questions: RQ1: What is the general resource usage? We aim to understand the usage of each server: the average system load; RAM, disk I/O, and GPU usage. We further study the average power consumption, the temperature, and the fan speed.

RQ2: What is the specific memory and network usage? The answer should include common ranges and modes in the distribution of memory consumption, etc., per node-measurement; linked when possible to known workload.

RQ3: What is the power consumption, per node and per rack? What is the rack temperature? We seek the (instantaneous) power consumption, including common ranges and modes. We want to further understand how the heat dissipates and if the cooling system is overwhelmed.

RQ4: How does the system load vary over time? We focus here on diurnal and longer-term patterns. (The current dataset does not enable seasonality analysis, but data keeps accumulating.)

RQ5: How do generic and ML nodes and racks differ?-orthogonal concern, applies to all other machine-related question.

RQ6: What is the impact of the COVID-19 pandemic?, especially how operations responded to workload changes.

B. Analysis of datacenter workload (results in Section 6): To understand if the workload exhibits similar properties to other traces known in the community, especially traces from scientific and Big Tech clusters, we formulate the following questions:

RQ7: What are the job characteristics?-job size in CPU-cores, job length, and variability across these features.

RQ8: What are the job arrival patterns? This question focuses on the basic statistics and time-patterns of job submissions.

RQ9: What is the peak demand?-explains the intensity of the peak demand, and contrasts it to normal operation.

RQ10: What are the patterns of job-failure?-fraction of jobs fail to complete and their resource-waste.

RQ11: How do long jobs behave? We consider this orthogonal concern for each of the other workload-related questions.

C. Generating insights from data (results in Section 7):

RQ12: How can we leverage fine-grained data?, focusing on using fine-grained data to perform better predictions.

RQ13: What are the implications of storing fine-grained data? This question focuses on the feasibility of storage for fine-grained metric data as well as how scalable its analysis is.

RQ14: How do metrics correlate? This question focuses on insights into low-level metrics correlation and the implication for data collection and analysis.

RQ15: What are the implications of holistic analysis for datacenter operation and design? This focuses on leveraging fine-grained data to tune and design efficient datacenters.

To enable reproducibility, we validate and open-source all the software (scripts) used in this work. All scripts are checked for correctness by at least two persons. They load raw data from the dataset available as FAIR [66] , open-access data at: 

We discuss here four known limitations to our method:

The most important limitation to our method derives from its holistic nature, which is also its strength. This nature is reflected in the broad analysis of several hundred metrics, which, as we show in the next three sections, helps understand how the whole works and gives actionable insights. However, datacenters can expose thousands of signals, so even our broad selection imposes a bias. Finding a complete and general, holistic method of analysis is beyond the scope of this work-a goal which we envision for the entire community, for the next decade, which already includes awardwinning work that focuses on selecting meaningful signals [69] and large-scale data collection [62, 44, 40] . Furthermore, the method proposed here can be contrasted with methods from the other end of the holistic-reductionist spectrum; compared with focused work on even one of the questions we address, our method cannot produce the same depth for the same effort. Without rehashing the broad and as-of-yet inconclusive debate of the entire scientific world about holism vs. reductionism, we draw attention to its current stand-off: both add value and should not be discarded, lest the community that does so fails in producing scientific discoveries, long-term.

A second limitation derives from the statistical methods used in this work and from the libraries that compute them. We use linear regression, because this is the most common form of fitting and thus it is likely to be understood by every member of the community. However, we envision that expert-level models could be developed, e.g., leveraging machine learning or higher order polynomials, giving better accuracy and precision. An example here could be to develop non-linear models where failures and even performance anomalies [25] are causally linked to signals from many metrics in the system, such as high load, extreme temperature, or unusual [19] and/or fail-slow hardware failures [24] . As discussed in Section 2, most metrics are not uniformly distributed, which is required for the Pearson correlation; nonetheless, the three correlation coefficients sketch a better picture together.

Another limitation is the vantage point, in that we look at data from a specific datacenter. This could affect especially the workload-level, where machine learning is emerging. However, more datasets as fine-grained as this work analyzes are currently not available publicly-we encourage datacenter operators to help! Last, the dataset we analyze is much more fine-grained than others, but there is still much room for additional data and further analysis of it. For example, datasets could further include details on (i) the operational policies, e.g., detailed scheduling queues and thresholds (e.g., in the Parallel Workloads Archive, as defined by the community since the late-1990s [9] ); (ii) the trade-offs considered during the capacity planning and datacenter design phases (e.g., of capability and cost); and (iii) the energy sourcing and flows (e.g., how the datacenter operations link with the energy markets and renewable energy-generation processes).

We present in this section a comprehensive characterization of machine operations in datacenters, with the method from Section 3.3. To obtain a holistic view of the workload and how resources are being used, we plot the number of jobs arriving and various resource-related metrics in Figure 3 . Each slice of a bar in the figure depicts an hour, where the color of the given slice is set to maximum normalized value observed within that hour. For the arrival of jobs, we count how many jobs arrive per 15 second interval (aligned with the metric samples) and then normalize the data using the 99 th percentile and clip the values to 1. We use the 99 th percentile to avoid that a few outliers skew the normalization. We then label five intensity classes-very low, low, moderate, high, and very high-, spread equally in the normal range, [0,1].

Setup: To depict how the overall datacenter is utilized, we use UNIX load1 as system load metric. UNIX load captures the "number of threads that are working or waiting to work" [22] . The load is an aggregate metric over time, e.g., load1 uses a one minute rolling window. The load can exceed the number of available server cores indicating the system is likely overloaded. We sum the load1 across all nodes and divide this number by the total available cores within the cluster, clipped to 1 as the values can reach well above 1, since there can be many more threads/processes running or queueing than available cores.

Further, we show the average server power usage normalized to 5,500 Watts which is the maximum the cooling system can handle per rack. The server temperature is normalized to the minimum of the maximum allowed temperatures for the different CPU models, which is the Intel® Xeon® Silver 4110 Processor having a limit of 77 degrees Celsius 3 .

The Server RAM usage shows the utilization of all the RAM in the datacenter. To obtain disk I/O usage, we sum the bytes read and written from both local storage and NFS mounts and divide this number by the peak bandwidth achievable by a server. The datacenter does not contain burst buffers or a distributed file system. The peak bandwidth of 1.8 GB/s, obtained from benchmarks run in the datacenter, fits high-speed NVMe setups, or RAID-0 over multiple disks or SSDs.

The GPU Power Usage, GPU temperature, and GPU fan Speed serve as proxy-metrics for GPU load, for which there is no direct utilization metric. The GPU power usage is normalized towards the Thermal Design Point (TDP) of each GPU according to Nvidia's official documentation.The temperature is normalized against the limits of the GTX 1080ti, Titan V, and RTX Titan which all share the thermal threshold of 91 degree Celsius according to Nvidia's official documentation. The GPU memory usage depicts how much of the GPU memory is being consumed across the datacenter. The memory limits for the GPU models are 11GB (GTX 1080ti), 12GB (Titan V), and 24GB (RTX Titan).

Observations: From Figure 3 , we gain several interesting insights that would not have been possible only with high-level performance metrics. We observe that the number of jobs incoming does not always overlap with any other metric (O-1). Intuitively, one would assume that the load would increase based on an increased number of incoming jobs, but as can be observed and further discussed in Section 4.5, one or more nodes peak continuously to high levels-the average system load is typically moderate (18.2% of the measurements), high (44.6%), or very high (20.2%) (O-2). We also observe that the power consumption reaches high levels most of the time. This suggest O-9: The longer the job duration, the higher the probability of high outliers for the number of transmitted packets.

We characterize RAM usage for the entire dataset; in Section 4.1, this shows that when designing a datacenter, only few nodes with a lot of RAM are required, reducing costs significantly and being more power efficient. This is further underlined by the long-tail of RAM usage, with a maximum of 2 TB (O-7), see Figure 4 . In Figure 5 we plot the number of transmitted packets versus the job time. We observe that shorter jobs seldom send more packets than longer running jobs, i.e., there are no extreme networkheavy yet short-running jobs (O-8) . This could indicate that the majority of the network traffic is in the initial setup, e.g., downloading data. Furthermore, both the number of transmitted packets and the outliers generally increase over time, but only marginally. Outliers appear more likely for long-running jobs (O-9). We plot in blue the curve found by the linear regression model fit, which shows that the increase in number of packets transmitted vs job duration is minimal. This could be due to, e.g., MPI jobs generating TCP traffic. Further analysis that includes more sophisticated network models, e.g., traffic-congestion analysis, is outside the scope of this work but would be possible because the dataset also includes metrics such as TCP retransmission [60] .

O-10: Generic nodes (racks) have more stable power consumption than ML nodes (racks). Energy consumption is becoming increasingly important [16] . To better understand the power consumption within the datacenter, we observe power consumption using two different levels. First we show the distribution of power consumption per rack, in Figure 6 . We additionally group together generic nodes and ML nodes as the latter contain accelerators (GPUs). Figure 6 shows that there is little to moderate variation in generic node rack power consumption, with the exception of rack 23. Furthermore, the IQR ranges of the boxplots within each violin plot show that most generic racks consume more power compared to ML racks. The ML racks show more variation and have higher extremes even though they contain fewer nodes (O-10), see Table 3 . The fluctuations are due to the power profile of GPUs: idle they consume as little as 1 Watt, yet at full load their power consumption goes up as high as 416 Watts. As ML nodes have up to four GPUs, the power consumption can go significantly higher than generic nodes. The reason for rack 23 being an outlier is that it only hosts one node vs. 30-32 for the other generic node racks. Hence, this causes a lower power consumption profile for the rack.

Next, we investigate the power consumption of individual nodes within each rack. From Figure 7 and Table 3 we observe that generic nodes feature a small range, typically between 80-260 Watts. One exception is rack 23 whose distribution is more than 3x higher compared to the other generic nodes. This is the only node with four sockets where its CPUs have a higher TDP than most of the other nodes. The node contains 48 CPU cores, 4x more than the regular generic nodes, in line with the 3x increase in power consumption( O-11). Moreover, the node contains significantly more RAM (see also Section 4.2) which means additional power draw.

Comparing the generic nodes with the ML nodes, we observe the generic nodes power consumption range is constrained, which in turn limit the ranges of the racks. As the generic node racks pack more nodes, they consume more energy, leading to the higher average seen in the previous discussed image. We also wondered if the lower number of nodes per ML rack is due to power supply unit or cooling system limitations. After inquiring the datacenter operators, the cooling system is indeed the limiting factor, only handling loads up to 5.5 kW per rack. We observe these are occasionally exceeded (O-12). Datacenter designs that include accelerators or aim for upgradeability have to consider this power-limiting aspect, underlined by the recently announced GPUs by Nvidia whose power consumption increased significantly 4 compared to older versions. The dataset we analyze in this paper contains multiple types of temperature-related metrics: GPU temperature, as well as server ambient temperature. While the former is the chip temperature, which is highly correlated with GPU workload, the latter is the temperature inside the server enclosure, which is influenced by many other factors: CPU workload, cooler (mal)functioning, as well as warmer nearby nodes and distance from the datacenter floor. According to the datacenter operator, all nodes in this study are air cooled.

We find that nodes in ML racks tend to be correlated in terms of temperature (O-13). They are either mostly warmer, or mostly cooler. Figure 8 plots temperature registered by servers in rack 32 for the entire period the dataset was collected. The graph also maintains the server ordering in the rack, with the smallest node id at the top (see vertical axis). We notice that the node positioning in the rack does not influence its temperature (O-14) . This finding matches the type of cooling used, i.e., air. Based on the experience of the datacenter operator, water cooling would not change the conclusion, because water cooling has superior heat dissipation. For the entire period, the lowest node temperature is around 20°C, while the highest temperature is 35°C. This range is significantly lower than reported by Netti et al. [40] where a range of 47-54°C is reported. The difference could be caused by a different node hardware and cooling system combination. The figure depicts clearly that hotter periods are correlated over the entire rack. This type of behavior holds for all ML racks. If water cooling is used, it's likely that these temperatures would remain low and thus will not correlate as observed, due to the efficiency of these systems [70] . The generic racks are much cooler: most nodes operate at 23-25°C, ≈3°C lower than most ML-rack nodes (O-15). Figure 9 the load1, load5, and load15 UNIX metrics across the entire datacenter. We notice that the average load is very stable within the datacenter (O-16). The averages range between 10.6 and 11.8 for load1. Interestingly, this does not match the arrival pattern of jobs visible in Figure 18 . This might be due to the loads being regularly above 16, depicted by the error bars. This behavior indicates that processes are getting delayed as the most common node within the datacenter features 16 cores. Figure 10 presents the load per day of week. We observe that the load in the cluster is also stable through the weeks, which aligns with the average per hour of day. There is a minimal elevation on Fridays and a small decrease in the weekend. Similarly to hour of day, the arrival of jobs does not correlate with the load.

When considering the load1 of ML nodes, we notice that it is stable, yet significantly lower than the cluster average. The average load1 per hour ranges between 6.3 and 7.4, which is around 40% lower than the average load across all machines (O-17) . This indicates that these machines are utilized less. In Section 6, where we characterize the workload in-depth, we notice in Figure 21 that indeed fewer users submit ML jobs.

We analyze how the datacenter operations changed during the 2020 COVID-19 pandemic, with the method from Section 3.3. We analyze data per rack and per note. Figure 13 

O-19: During covid, average RAM usage generally decreases in generic racks, but not in ML racks.

The average RAM utilization is 1% to 5% higher during the non-covid period than during the covid period, for the nine generic racks (O-19) ; only one rack exhibits higher RAM utilization during the covid period. This can also be observed in Figure 11 where both the IQRs and the whiskers are higher for the respective boxplots. From the ML racks, changes in RAM usage are mixed: 3 racks exhibit a decrease, 2 racks an increase. We conjecture the specialist use of ML racks makes it less likely to change behavior; in the Netherlands, experts continued work without much disruption, remotely.

Across all nodes, the RAM utilization is slightly higher in the non-covid period, see Figure 12 . prior outlier rack r23. We attribute this phenomenon to the datacenter continuing regular service during covid, but onboarding fewer inexperienced users that could introduce variable load while learning how to use the system.

Interestingly, rack 23 does show more and more extreme outliers in the covid period, see Figure 11 . With a significant RAM utilization (i.e., 50+%), the node appears to be used more intensely during the covid period. For both periods, ML node racks consume more power than the generic node racks (with the exception of special rack 23).

The temperature for both periods is very close to 25°C on average, which is the ideal temperature for servers [45] . The generic nodes do not exhibit temperature increases in general; only rack 23 exhibits a 2-3°C increase (O-22) . Except for rack r30, the ML node racks have 1°C to 3°C higher temperature during the covid period, especially rack r31 (O-23). These elevations also show in the boxplots, see Figure 11 . However, these increases of just a few degrees still correspond to normal operation. We conclude there are no significant temperature differences between the covid and non-covid periods.

O-24: Increased load for several generic racks, during covid.

O-25: ML racks unchanged. Rack 30 decrease during covid.

The generic node racks, except for rack r23, have a higher average and significantly higher outliers during the covid period. This can be observed in Figure 11 . This indicates nodes utilized heavier, and particularly using short, heavy bursts, as the average remains low in comparison to the values of the peaks, see . The reverse holds for Rack 23, which is surprising giving the previous metrics show an elevation. As the node has 48 CPU cores, as outlined in Section 4.3, this node is rarely overloaded. These results suggest the jobs submitted during the covid period generate fewer tasks to be processed in parallel by the individual cores, yet they do lead to more power consumption and RAM utilization, which in turn could cause the elevation in temperature.

From the ML node racks, only Rack 30 has significantly different load-much lower load during the covid period. Nodes in the other ML racks exhibit no significant load differences during covid (O-25).

We characterize in this section the datacenter workload, with the method from Section 3.3. Generic nodes ML nodes For job sizes, we depict the frequency of allocated CPU-cores in Figure 14 (next page). Most jobs are small (O-26). Considering the number of requested cores (equal to the number of allocated cores in this system), Figure 14 features a peak for 16 cores. This is equal, for example, to the number of requested cores in the Google trace [2] . As the most common nodes in the system have 16 cores, we believe most users simply request one full node using SLURM; the largest queue in SLURM enables this behavior. Most submitted jobs request less than 100 CPU cores, with extremes using over 500 CPU cores (O-26). Few users queue large jobs as, depending on the job-placement policy, it can take a considerable amount of time before enough resources become available.

We inspect the runtime of jobs within the datacenter. Figure 15 shows the CDF of job durations. Most jobs are short: 88.9% of all completed jobs have a runtime of 5 minutes or less (O-27). shows short-jobs also consume less, cumulatively, than long-running jobs. The cumulative runtime of short jobs is more than 177× smaller than for jobs running for a day or longer. Interestingly, jobs lasting up and until one hour take up a noticeably larger share when compared to other publicly available cluster traces [2, 26, 50 ]. Combining data depicted in Figures 18, 19 , and 20, we observe a highly variable job-arrival process (O-28). In contrast, the number of consumed CPU-core hours varies by at most two orders of magnitude, see Figure 17 . Unlike the Mustang and OpenTrinity traces, the trace we analyze does feature a clear diurnal pattern in job submissions, depicted in Figure 18 . We observe an office-like daily pattern (O-29), with job submissions ramping up in the morning after 9am and lasting until office closing time. This confirms the expectations of the datacenter operational team. However, we also observe job-submissions still occur, until 4am.Job submissions per day of week also vary greatly, see Figure 19 . The difference between Sun (lowest, 5,160.2) and Friday (highest, 14,753.9) is 2.86×.

Following the method of Amvrosiadis et al. [2] , we classify as high arrival rate a rate of over 10,000 submitted jobs per day. Figure 20 shows the maximum number of submitted jobs on a single day is 167,189, and the average rate is above 10,000 (O-30). We observe significantly fewer ML jobs arrive on average, compared to generic jobs (O-31). Figure 20 depicts this phenomenon. The median number of ML-job arrivals per day is only 320, an order of magnitude lower than the median arrival rate for all jobs. We link this to the system setup, where users require additional permission to submit jobs to ML nodes.

O-32: There are periods with high, sub-second job arrivals.

O-33: Low variability in the number of requested CPU-cores.

We analyze now the peak demand of the datacenter. The datacenter experiences periods with high, sub-second job-arrival rates (O-32); these appear in Figure 20 as daily peaks larger than 10 5 . These translate to resource over-commitment; although the allocation of CPU-cores is limited using cgroups, other resources such as network and disk I/O are not rate-limited. Following the approach of Amvrosiadis et al. [2] , we compute the coefficient of variation (CoV) of CPU cores requested per user. We observe in Figure 21 the CoV is at most 2, with a rapid decrease below 1, low values (O-33) similar to those observed at Google as reported by Amvrosiadis et al. O-36: Unsuccessful jobs consume a significant amount of resources, and at worst they do so until they timeout.

O-37: Among all classes of runtimes, ML jobs terminate unsuccessfully more often than other jobs.

Relatively few jobs have unsuccessful job outcomes, see Table 4 . As we observe, more than 91% of jobs complete successfully (O-34), which is more than the highest fraction reported by Amvrosiadis et al. [2] .

In contrast, we observe that longer jobs and jobs that consume more resources tend to fail more often (O-35), see Figure 16 . For the latter category, for all (ML) jobs, a high fraction of 51.2% (55.8%) of the runtime is spent on non-completed jobs. For long-running jobs, (ML) jobs that do not complete consume 13.8% (51.9%) (O-36) .

Across all job durations, between 32.3% and 55.8% of the ML jobs complete unsuccessfully; this is more often than all jobs (12.9-51.2%) (O-37). We depict the total sum of job runtimes and their fraction per job state. The behavior of longer jobs failing more often is mainly due to timeouts, as there is a 5-day limit in the datacenter, as the operators reported. The data shows clearly that larger jobs fail more often and consume more time than smaller jobs. We have presented an in-depth analysis of several of the metrics listed in the archive we consider. We continue by presenting how these results can be leveraged by the community at large to better understand datacenter behavior, how to build more efficient datacenters and how to predict the artifacts of their operations.

For the principle of holistic analysis to gain traction, the community needs to find useful guidelines and applications. Toward this end, but limited in scope, this section presents several examples.

Online performance prediction is a well-established instrument for datacenter operation, useful among others for optimization at both system [56] and application-level [36] . In the past two decades, online performance prediction using machine learning (ML) has become common [13] . Because ML methods such as Long Short-Term Memory (LSTM) networks [4] can operate on any data, new questions arise: Is higher-frequency data more useful for online prediction than low-frequency data? How high should the frequency be?

To address this question, we focus in this section on the use of LSTM to predict online performance data for the next 20 minutes of operation. LSTM is common in practice, including for datacenter workloads [51] , which allows us to focus on the new research questions. We vary the granularity of the input, from high frequency (15 seconds) to low (up to 10 minutes), and observe the quality of the prediction provided by the LSTM predictor (using a common metric, Huber loss). Common practice in datacenters and public clouds samples at 5-/10-minutes, and even lower frequency.

We employ the LSTM model depicted in Figure 22 . Using it requires two setup elements. First, to select prediction metrics and prepare data for LSTM use, we focus on data studied during pairwise correlation analysis (in Section 2). We select node load1 and node sockstat sockets used, which we find to correlate well. We normalize data to make it suitable as LSTM input.

Second, to train the LSTM model for prediction (inference), we create four different datasets from the same data: (1) the "raw" 15-second interval data as present in the original SURFace dataset, and data resampled in (2) 1-minute, (3) 5-minute, and (4) 10-minute time intervals. We train the model on each dataset, resulting in four separate predictors. We split each training dataset into 2-hour chunks; as Figure 22 depicts, for the 15-second dataset this results in 480 input tuples. (The three other datasets include 120, 24, and 12 inputs, respectively.) We configure the model to generate a prediction for 20-minute window, with predictions 15 seconds apart (80 predictions). We use 10% of the entire data for evaluation purposes; this allows to evaluate the generality of the model, by inspecting its performance on unseen data. Table 5 compares the loss values for the four trained models, with highlighted cells indicating which model delivers the best prediction. We perform this analysis on a randomly selected set of 13 general and ML nodes. We show here results for 5 nodes; these are representative for all the results in this analysis. The LSTM model trained with the 10-minute dataset is never the best predictor, indicating datacenter operators should use higher-frequency data for prediction. Among the 15-second and 1-/5-minute datasets, we find the former is the best-performer 7 times, with 3 ties, suggesting choice in the accuracy-performance trade-off. To conclude: Higher-frequency metric data improves performance predictions when using LSTMs. We recommend datacenters collect and provide such data. 

Computation and Storage Overheads. The computational load for training LSTM models on 1or 5-minute data is significantly lower than for 15-second intervals; applied to 2-hour chunks, we find all training remains feasible at datacenter-level, and inference imposes a negligible computational cost. Additionally, the amount of storage required for the fine-grained data is non-linear with the number of samples due to compression. Intuitively, storing a 2× larger dataset would require 2× storage. However, with modern storage formats, that leverage compression and columnar formats, this is not the case. Using snappy compression (the default for parquet), the data representing a 10minute granularity snapshot for the two metrics used in this example requires 32.77 MB of storage.

In turn, only 277.56 MB is required to store data with a granularity of 15 seconds, so increasing the volume of uncompressed data 40× increases the actual storage by only 8.47×. Therefore, leveraging modern data storage techniques enables storing high-frequency data with sub-linear overheads. To conclude, higher-frequency metric data incurs both computational and storage overheads, but these seem worthwhile when compared with the benefits they enable. Metric Correlations. The analysis we depict in Figure 1 shows a novel insight: only 40 pairs of low-level metrics are correlated over a time period of 50 days. The pairs of correlated metrics differ significantly per day, leading to over 14,000 unique correlations over the entire period. Having so many pairs that correlate infrequently shows strong evidence that correlations are workload dependent, therefore as many metrics must be captured as frequently as possible. Our guideline is to only eliminate the metrics that are correlated over long periods of time.

We conclude this section with anecdotal insight from our correlation analysis. We find many metrics that, intuitively, correlate persistently: load1 with load15, netstat TCP data with netstat IP data, and swap memory with free memory. By manually inspecting the correlations that are not persistent over time, we find other, more interesting correlations that would be difficult to predict even by experts. Table 6 presents three metrics linking IO and GPU processing, corroborating recent 

Our final guideline is to use fine-grained data for designing and tuning datacenters, from individual chips to full-system procurement. We support this guideline with qualitative analysis. Datacenters are often acquiring homogeneous batches of hardware. Often, for datacenters for scientific computing and engineering, nodes pack a large x86 CPU and large amounts of memory. Clusters equipped for HPC and machine learning often also add GPUs and high-speed interconnects. The power envelope of datacenters has constantly increased, and modern large-scale datacenters approach the limits of what our society can leverage in terms of power while being mindful of carbon emissions [31, 14] . Others have considered power savings by means of reducing cooling [17] , but that is only one example of the many aspects that could be considered. In this paper, we have analyzed many metrics, all with potential impact on how datacenters could be tuned and designed. We posit that using such data for customizing datacenters suited to their user's needs is Table 6 : Correlated metrics identified by analyzing the dataset generated by analysis in Figure 1 .

Metric 2 server swap memory GPU temperature network receive fifo GPU temperature TCP open sockets GPU temperature key for efficiency. Using the resource usage profiles uncovered in this work one could, for example, build machine-learning clusters more efficiently by leveraging lower-power CPUs (e.g., ARM and RISC-V) next to power-hungry GPUs. In GPU-based ML workloads, power-hungry x86 CPUs are underutilized, being mostly used in data pre-processing and data movement. Moreover, as memory usage is low in our traces, for similar workloads the designer need not to purchase large amounts of RAM. For inadvertent peak-loads, designers could leverage software disaggregation methods [23, 59] , instead of hardware acquisition. Uncovering inefficiencies in datacenters by holistic performance analysis approaches can also lead to improved chip design. In the post-Moore era, this is an avenue beginning to be explored by large tech companies and hardware manufacturers. Google pioneered optimizing ML training with TPUs [29] . This trend continues at organizations like Amazon or Nvidia, who are building inferencetailored chips [1, 41] , or even deep-learning programmable engines [68] . Only with such analysis, practitioners could tackle both performance, power consumption and other important metrics at the same time. Similar trends have already started at the network level, where in-network computing is already a reality [55] . Significantly improving network performance, and releasing pressure from CPUs is something that our data already supports (see Figures 3 and 5) . Already, the analysis we have conducted in this work has helped the datacenter operator improve the design of their next monitoring system.

In this section, we group related works by topic and discuss our contributions with respect to each. Overall, we propose a holistic method of analysis, and use it on a dataset with unprecedented temporal and spatial granularity among public datasets.

Datacenter operations: Several articles provide a holistic view of datacenter operations, including job allocation [3] , cloud services [37] , physical network [32] , etc. Different from related work, our article provides a view of the effect of the workload on machine metrics. This complements prior work and aids in understanding the operations within modern datacenters.

Characterizations of workloads: There are various articles on the topic of characterizing workloads from Google [46, 47] , FinTech [2, 65] , scientific computing environments [2, 65, 54] , etc. Adding to this topic, we demonstrate our workload is unique in terms of properties. Additionally, many of the jobs are from the ML domain, which, combined with the machine metric characterization, provides interesting (and sometimes contrasting) insights.

Characterizations of machine metrics: There is also a body of related work focusing on machine metrics. Closest to this work is the work done on a subset of the dataset we characterize [60] . Other related work focuses on few, high-level metrics [30, 50, 5] . Different in our work is the various additional and novel angles we use and the discussion of the implications in various directions such as hardware design and societal aspects.

Metric correlations: Some related works make use of correlation coefficients. There are many applications to finding correlations, e.g., finding metric correlations that hold over longer periods of time [10] , finding (virtual) machines executing the same application [8] , finding (virtual) machines that correlate in resource utilization [30] to minimize contention, checking for correlations between resources requested in datacenters [50] , etc. We focus not only on finding correlations that hold over a long period of time, but also demonstrate that correlations are workload dependent. We use three correlation methods to study how metric pairs are correlated, and use significantly more metric pairs (over 14,000).

For decades we have been focusing on optimizing systems for only the metrics we measure. Thus, to conquer the ever-increasing complexity of our datacenters, we posit the need for holistic overview of such systems. In this work we propose a holistic method of analysis for datacenter operations.

We applied the method on a public, long-term datacenter traces of unprecedented temporal and spatial granularity. Poring billions of data points in total, with samples collected at 15-second intervals covering hundreds of operational metrics, we characterized the machines, power consumption, and workload. We distinguished between generic and ML-specific information, and between regular operation and operation during the 2020 COVID-19 pandemic. We made over 30 main observations, which give detailed, holistic insight into the operation of a public scientific infrastructure. Finally, we discussed the implications of our findings on online ML-based prediction, and on long-term datacenter design and tuning.

We envision our work, and similar pioneering efforts, as motivators for a community-driven approach embracing holistic analysis. Concretely, we envision our analysis will show organizations the potential of collecting, and ultimately releasing, many more fine-grained datasets. We also envision studies comparing such datasets, finding invariants and trends, and thus bolstering fundamental knowledge in the field. Last, we envision new techniques for datacenter operations, from dynamic scheduling to long-term resource procurement, all enhanced by the use of holistic data and considerations.

For future work, investigating different forecasting techniques can be interesting to see what leverage this fine grained data can further offer. Due to the richness of this dataset, we believe more interesting characterizations can be done, which is another item for future work. Comparing the findings of this -and potentially followup -work with another rich dataset would be very interesting to observe if findings hold across multiple systems and workloads. Such comparisons can lead to the development of new scheduling approaches or the design of new systems. Furthermore, having access to multi-year data can eliminate the effect of seasonality as discussed in our threats to validity. Accounting for, eliminating, and comparing the effect of seasonality will further contribute to our understanding of these systems.

High performance machine learning inference chip

On the diversity of cluster workloads and its impact on research results

A reference architecture for datacenter scheduling: design, validation, and experiments

Predicting amazon spot prices with LSTM networks

Data centers in the cloud: A large scale performance study

Operational data analytics: Optimizing the national energy research scientific computing center cooling systems

Getting what you measure

Identifying communication patterns between virtual machines in software-defined data centers

Benchmarks and standards for the evaluation of parallel job schedulers

Resource central: Understanding and predicting workloads for improved resource management in large cloud platforms

The tail at scale. CACM, 56

A prediction method for job runtimes on shared processors: Survey, statistical analysis and new avenues

State of the Dutch data centers

Ontwikkeling COVID-19 in grafieken

Performance evaluation of a green scheduling algorithm for energy savings in cloud computing

Temperature management in data centers: why some (might) like it hot

Experience with using the parallel workloads archive

Anomaly detection in high performance computers: A vicinity perspective

Dominant resource fairness: Fair allocation of multiple resource types

Greenslot: scheduling energy consumption in green datacenters

Linux load averages: Solving the Mystery

Efficient memory disaggregation with infiniswap

Fail-slow at scale: Evidence of hardware performance faults in large production systems

Performance anomaly detection and bottleneck identification

The grid workloads archive. FGCS, 24

Ddlbench: Towards a scalable benchmarking infrastructure for distributed deep learning

Analysis of large-scale multi-tenant gpu clusters for dnn training workloads

Motivation for and evaluation of the first tensor processing unit

Correlation-aware virtual machine allocation for energy-efficient datacenters

Recalibrating global data center energy-use estimates

Fiber optic communication technologies: What's needed for datacenter network operations

Beneath the SURFace: An MRI-like View into the Life of a 21st Century Datacenter

Monitoring and control of large systems with monalisa

Ease.ml: Towards multi-tenant resource sharing for machine learning workloads

Machine learning based online performance prediction for runtime parallelization and task scheduling

Nist cloud computing reference architecture. NIST special publication

A year in the life of a parallel file system

Taming performance variability

DCDB wintermute: Enabling online and holistic operational data analytics on HPC systems

Always measure one level deeper

Job characteristics on large-scale systems: long-term analysis, quantification, and implications

What does power consumption behavior of HPC jobs reveal? : Demystifying, quantifying, and predicting power consumption characteristics

Predicting and mitigating jobs failures in big data clusters

Failure analysis and prediction for big-data systems

Correlation coefficients: appropriate use and interpretation. Anesthesia & Analgesia

Serverless in the wild: Characterizing and optimizing the serverless workload at a large cloud provider

Statistical characterization of business-critical workloads hosted in cloud datacenters

The performance of LSTM and bilstm in forecasting time series

Dapper, a large-scale distributed systems tracing infrastructure

Community resources for enabling research in distributed scientific workflows

Community framework for enabling scientific workflow research and development

Your programmable nic should be a programmable switch

Multivariate resource performance forecasting in the network weather service

Distributed computing in practice: the condor experience

Borg: the next generation

Towards resource disaggregation-memory scavenging for scientific workloads

Beneath the surface: An mri-like view into the life of a 21st-century datacenter

Is big data performance reproducible in modern cloud networks

GUIDE: a scalable information directory service to collect, federate, and analyze logs for operational insights into a leadership HPC facility

Two sides of a coin: Optimizing the schedule of mapreduce jobs to minimize their makespan and improve cluster performance

Large-scale cluster management at google with borg

The workflow trace archive: Open-access data from public and private computing infrastructures

The FAIR Guiding Principles for scientific data management and stewardship

Using spearman's correlation coefficients for exploratory data analysis on big dataset

XILINX. The Xilinx® Deep Learning Processor Unit (DPU)

vperfguard: an automated model-driven framework for application performance diagnosis in consolidated cloud environments

Comparison and evaluation of air cooling and water cooling in resource consumption and economic performance. Energy

Non-intrusive performance profiling for entire software stacks based on the flow reconstruction principle