key: cord-0519692-t6cegah7
authors: Aktar, Shamminuj; Abdelkhalik, Hamdy; Turja, Nazmul Haque; Arafa, Yehia; Barai, Atanu; Panda, Nishant; Chennupati, Gopinath; Santhi, Nandakishore; Eidenbenz, Stephan; Badawy, Abdel-Hameed
title: BB-ML: Basic Block Performance Prediction using Machine Learning Techniques
date: 2022-02-16
journal: nan
DOI: nan
sha: b92825d268f7c2098b45ae313be85c1610f1a2fb
doc_id: 519692
cord_uid: t6cegah7

Recent years have seen the adoption of Machine Learning (ML) techniques to predict the performance of large-scale applications, mostly at a coarse level. In contrast, we propose to use ML techniques for performance prediction at much finer granularity, namely at the levels of Basic Block (BB), which are the single entry-single exit code blocks that are used as analysis tools by all compilers to break down a large code into manageable pieces. Utilizing ML and BB analysis together can enable scalable hardware-software co-design beyond the current state of the art. In this work, we extrapolate the basic block execution counts of GPU applications for large inputs sizes from the counts of smaller input sizes of the same application. We employ two ML models, a Poisson Neural Network (PNN) and a Bayesian Regularization Backpropagation Neural Network (BR-BPNN). We train both models using the lowest input values of the application and random input values to predict basic block counts. Results show that our models accurately predict the basic block execution counts of 16 benchmark applications. For PNN and BR-BPNN models, we achieve an average accuracy of 93.5% and 95.6%, respectively, while extrapolating the basic block counts for large input sets when the model is trained using smaller input sets. Additionally, the models show an average accuracy of 97.7% and 98.1%, respectively, while predicting basic block counts on random instances.

Graphics Processing Units (GPUs) have rapidly advanced to become the most common high-performance accelerators for highperformance computing. Modern GPUs contain hundreds of processing units, capable of achieving up to 9.7 TFLOPS (Trillion Floating Point Operations per Second) for mixed-precision calculations [3] . The highly parallel execution model, high performance on floating-point arithmetic, and memory parallelism on GPUs make them particularly well-suited to many scientific and engineering workloads that occupy High-Performance Computing (HPC) clusters. The latest TOP500 list shows that the new wave of supercomputers is GPU accelerated systems [1] . This wide usage has triggered extensive research on performance modeling and simulation of modern GPUs (ModSim).

Modeling and simulation tools play a significant role in designing and evaluating new hardware features and understanding the limiting factors on application performance. Due to modern GPUs' complexity, it is no longer intuitive to improve its hardware performance. Using ModSim tools, GPU architects get insight into the impact of any changes in hardware design on application performance, area and, power consumption. Thus, ModSim tools enable design space exploration. On the other hand, ModSim tools are also used to a great extent in hardware-software co-design to tune application performance. They help system designers choose the proper hardware for any specific workload. Also, application developers can tune their applications for a wide variety of GPUs without having access to the physical hardware. Scalable ModSim tools are necessary to allow quick and accurate design space exploration.

Performance analysis centered around the software concept of basic blocks (BB), defined as a single entry-single exit sections of code, has recently been proposed for CPU architectures [7, 9, 12, 33] . The number of times a basic block is executed in a specific application is a good measure of the impact on the performance of that basic block for the whole application. A higher count leads to a higher impact. Thus, knowing the number of times (count) a BB executes is necessary for BB level performance analysis.

By characterizing the performance signature of the basic blocks and extracting the counts and other relevant information at a basic block level granularity, we can predict the performance of an application in finer granularity more efficiently and scalably. However, extracting the BB execution counts for large input sizes of an application can be very costly and become a scalability bottleneck, and it requires executing or profiling the application. Thus, addressing this inefficiency is necessary to achieve scalable performance prediction for ModSim tools which would further enable hardware-software co-design. Therefore, BB count extrapolation enables scalable ModSim tools that use BB analysis in their performance prediction flow [9, 10, 12? ]. Furthermore, BB analysis is needed in compiler analysis passes [2] , and this work could be leveraged for that purpose.

In this paper, we introduce BB-ML. BB-ML uses Deep Neural Networks and regression-based machine learning techniques to predict basic block execution counts of GPU applications. First, we build a Poisson Neural Network (PNN). PNN is a probabilistic model that treats basic block counts as a Poisson random variable and leverages Deep Neural Networks (DNNs) to learn to extrapolate basic block counts for larger input configurations (sizes) of applications. Furthermore, we employ the Bayesian Regularization Backpropagation Neural Network(BR-BPNN) that treats basic block count extrapolation as a regression task. For both models, we use labeled training data, where the input parameters of applications are used as input features for the neural networks, and basic block counts are used as the output of the networks. Results show that both networks can accurately extrapolate basic block counts for large input configurations and random input values of the applications. Also, we compare the results of PNN and BR-BPNN networks to find a generalized network that can fit applications from different domains. The result shows that the average accuracy of both networks for extrapolation is similar. In addition, we observe that BR-BPNN works better for linearly separable applications while PNN fits better for applications with non-linear behavior.

The contributions of the paper are:

• Up to our knowledge, this is the first work to predict BB counts for GPU applications. The rest of the paper is organized as follows: Section 2 summarizes the relevant related work and sets this work apart from the relevant related work. Section 3 covers the necessary background and shows a control flow graph with Basic Blocks of a small sample program. Section 4 overviews the basic block trace extraction, the two neural network architectures, and their training. Section 5 goes over the experimental results of our NN models. It compares the two models and shows detailed results. Finally, Section 6 concludes the paper.

Basic block level performance prediction attracted much research in CPU [12] and GPU [10] domains. Tools such as llvm-mca [2] from the LLVM team or IACA [5] from Intel perform basic block level program analysis. This section shows some of the work centered around basic blocks, predicting their count, throughput, or leveraging the basic block information to predict the whole program execution time.

In the CPU domain, Mendis et al. [19] proposed Ithemal, which is a tool that predicts the basic blocks throughput for applications. The authors used a regression-based model using Long Short-Term Memory (LSTM) recurrent neural network to predict how many times the basic block gets executed. Their model takes the basic block assembly instructions as input and provides the basic block throughput as output. In the same spirit, Abel et al. [7] proposed a parametric pipeline model and basic block throughput predictor for Intel processors. Their model also addresses some limitations in the Ithemal model. Other works such as PPT-AMMP [12] proposed by Chennupati et al. demonstrated a regression-based technique to predict the BB counts of CPU applications for large input sets. They used the counts to predict the runtime of serial applications on a single-core CPU.

In the GPU domain, Betts et al. [10] proposed a model that leveraged the basic block execution information in terms of execution traces and execution time to predict a worst-case execution time of the whole application. They used a cycle-accurate simulator to extract the basic block traces. The problem with using cycleaccurate simulators is that they are very slow. For high input sizes, it could take weeks, if not months, to finish execution and extract the traces [8] . Another drawback is that trace extraction needs to be done every time the input changes, which is not a daunting task on cycle-accurate simulators.

Up to our knowledge, there exists no prior work that predicts the BB executions throughput for GPU architectures. Adopting this work in the future can reduce the time and space of trace extraction for performance prediction models [8, 10] . Thus, enabling scalable predictions that can train the model once per application on small input sizes and reconstruct a trace for higher input sizes.

A Basic Block (BB) represents a straight code sequence with no branches. In other words, a BB is a piece of code that has a single entry and a single exit point with no branches. In a control flow graph (CFG) of a program, if a program execution enters a basic block, all the instructions in that basic block execute until a branch instruction is encountered. Generally, the compiler breaks down the program into multiple basic blocks during the analysis process. 

In the CUDA programming language, a group of threads is called a CUDA block. CUDA blocks are grouped into a grid. A GPU program or kernel is a function executed concurrently as a grid of thread blocks. GPU programs utilize data, thread, and task parallelisms. From the software perspective, GPUs have three execution levels: block-level, warp-level, and thread-level, to achieve such parallelisms. Additional information can be found in the CUDA programming model [4] .

Researchers from diverse domains widely employ machine learning techniques for solving complex problems. Machine learning techniques use mathematical models and data analysis to learn information directly from data. Probabilistic models assist machine learning by examining the underlying distribution of datasets and learning the pattern from inputs and outputs.

Regression is one of the machine learning techniques from supervised learning where the model learns from labeled samples. Regression techniques are employed for predicting real or continuous value using the knowledge learned from a labeled dataset. The most commonly used regression techniques are linear regression, logistic regression, and polynomial regression [13] .

Deep Neural Networks inspired by information processing of the human brain, use a hidden layer to store and find the relationship between input parameters and output. A neural network consists of input layers, hidden layer, and output layer where information is passed through the layers and updated using some feedback mechanisms. Activation function, learning mechanism, and loss function are used in the neural network to optimize the training process. An artificial neural network composed of multiple hidden layers is a Deep Neural Network (DNN). DNN's are efficient in modeling complex non-linear real-world problems. Some commonly used DNN frameworks are TensorFlow [6] , PyTorch [24] , and Keras [16] .

This section goes over the various steps needed for Basic Block count prediction. We first explain the Basic Block trace extraction, then we detail the architecture of the Neural Network models, and finally the training of the models.

BB-ML can utilize any NVIDIA GPU given its configuration parameters. We are limited to NVIDIA GPU since we use their tools for BB count profiling. The methodology can be extended to any other GPU, given that we can collect the necessary traces. Figure 2 shows a high-level overview of BB-ML. It comprises various components that feed off each other to predict the BB counts eventually. First, the BB Trace Extraction Tool extracts basic block traces by running the application on an actual GPU hardware. Then, the BB traces collected are fed to the two Neural Network models to get the BB Count Predictions.

The first step of the model is extracting the basic block execution traces. To extract the basic block trace, we extended NVBit [28] to get the trace dynamically. NVBit is a dynamic binary instrumentation tool from Nvidia research, and it can implement profilers, error checking, and collecting traces statically and dynamically. We utilized this tool to dynamically collect the basic block counts per kernel for GPU applications.

We extract an application-dependent basic block trace. Thus, we can use any Nvidia GPU from any architecture. We extract the Algorithm 1 Trace Extraction Algorithm 1: end while 21: end procedure different application traces using a Tesla A100 GPU from Ampere architecture, a Titan V GPU, and a Tesla V100 GPU from the Volta architecture.

Algorithm 1 shows a high-level overview of the logic of extracting the basic block trace using Nvbit. The first step for the algorithm is to instantiate and initialize the necessary variables (Basic Block ID, Kernel ID, and the arrays holding the BB count for each BB and the array of kernel IDs) as shown in lines [1] [2] [3] [4] . We set the array sizes to 1000 for illustration purposes, but the array size varies by the number of BBs per kernel and on the number of kernels in a GPU application. The Nvbit tool assigns an automatic ID for each BB and kernel, but these values are very large and not in order. For simplicity, we use The ID_BB and ID_ker variables in lines [1,2] to assign new in order IDs for the BBs and the kernels. In lines [5] [6] [7] [8] [9] [10] [11] [12] [13] [14] , we go over all the kernels and the basic blocks in the application and inject a function after each basic block to collect the IDs of the BB and the kernels. In lines [15] [16] [17] [18] [19] [20] [21] , we use another function to receive the IDs data dynamically and calculate the BB counts. In line 19, we store the kernel ID in the same index as the BB count because we need to associate the specific kernel that each BB belongs to.

We follow the methodology suggested in the NVBIT tool to profile the applications at runtime. We wrote a script that automates the BB traces automatically while changing the input at each invocation. We collect the BB count per BB per kernel per application, later fed to the ML models.

This section describes the neural networks used to build the prediction models. First, we build a Poisson Neural Network Model (PNN), a probabilistic model that treats the underlying distribution of basic block execution counts as a Poisson probability distribution. The task of predicting counts is often modeled as a regression problem in various domains [17, 22, 31] . To compare and verify our results, we tried to train several linear regression networks, e.g., gradient descent backpropagation [26] , gradient descent with adaptive learning rate backpropagation [26] , conjugate gradient backpropagation [26] , and quasi-Newton backpropagation [27] . Although the trained networks showed good accuracy for some of the applications, they could not fit some of the benchmarks of Table 1 . Therefore, we employ the Bayesian Regularization Backpropagation Neural Network(BR-BPNN) that uses the Bayesian Regularization algorithm and minimizes a linear combination of squared errors and weights [14] to compare PNN to a competitive regression model.

A Poisson probability distribution models the occurrence of an event within a fixed period of time. It is commonly used to model a system where the target variable is a discrete count variable. Basic block executions counts are always positive and discrete in nature. We analyze the distribution of basic block counts and their relationship to the input parameters. Figure 3 presents kernel density estimation (KDE) plot of counts for one basic block of the bicg dataset from Table 1 . The X-axis represents the basic block counts, and the Y-axis shows the probability density function for the kernel density estimation. The density plot shows the underlying distribution of basic block counts, which is Poisson-like, where each count is independent. From the Table 1 showing the underlying probability distribution of counts. The probability of each basic block count can be computed by finding the area under the curve on the x-axis.

distribution, we find that basic block counts can be modeled as a Poisson random variable with mean . We propose a machine learning model that can learn mapping ( ) from input parameters ( ) to predict basic block count. Suppose is our input parameter vector, and is the basic block count. In that case, we model the basic block count as a Poisson random variable with mean where only depends on input parameter . Thus, the probability that has count is given by

The Poisson Neural Network (PNN) uses a Poisson loss function to minimize the loss and maximize the likelihood of the data. Py-Torch's Poisson negative log likelihood ( . ) loss function is used to compute the loss in the training process where we choose the _ parameter to be [25] . The loss is a scalar value which is computed from input and target as,

where is a small value which is added to avoid computation of (0). We implemented the PNN model using PyTorch. The input layer neurons are fully connected to the hidden layer, and the hidden layer is also fully connected to the output layer. The hidden layer's size is ten, and the output layer is 1, where the input layer size depends on the number of inputs of the application. We used the ℎ activation function to transform the weighted sum of the input into an output.

optimizer is used in PNN to update weights and learning rate in the training stage. To optimize the training process, we tune the hyperparameters (number of epochs, batch size, and learning rate). Rigorous tuning shows that the network gets the minimal loss for specific values of the hyperparameters. Therefore, we train the models using a batch size of 10, 300 epochs and choose the learning rate to be 10 −4 . [20, 23, 32] as it provides two major advantages over normal backpropagation neural networks. It overcomes overfitting and overtraining. In general, the mean square error values of training a feedforward neural network can be enhanced by using backpropagation methods such as the Levenberg Marquardt optimization algorithm [18] . Furthermore, the regularization technique decreases the error values in the backpropagation stage. Instead of calculating the MSE for the data only as in Equation 3, the regularization adds another term called weight decay term as in Equation 4 and then an objective function to penalize large weights to obtain a smoother mapping. Also, this term prevents the network from overfitting [21] .

where the ratio / controls how smooth the weight changes, the larger the ratio, the smoother the network response. Bayesian regularization algorithm is used to calculate the optimum alpha and beta values for the neural network based on Bayes' rule.

To address the problem of extrapolation from unseen high input configurations, we model the problem as a regression task. First, we started with a linear regression network which fitted well for some of the benchmarks listed in Table 1 . However, the linear regression could not fit more complex applications. This motivated with the Poisson NN model. We treated the hidden layer size as a hyperparameter for the BR-BPNN. For most benchmarks, the hidden layer was one neuron only. The hidden layer size was chosen to be 10 for the gramschmit application. The BR-BPNN network uses the MSE loss function, the most commonly used loss function for regression. The activation function we used was the tan-sigmoid function defined in Equation 5 .

We implemented this model in MATLAB 2021b and utilized 1000 epochs for each benchmark. The rest of the hyperparameters for BR-BPNN were the default parameters in MATLAB.

We have three sets of training and testing data generated from the same original collected data. We split the data into high-low, mixed high-low, and random data. Here, we explain how the data is split. First, for the high-low data, we assign the lower 70% of the input range of the input samples to be the training set and the remaining 30% of the input range, which represents the higher range input samples to be the test set. There are some input values with a combination of low and high values. We first train our models without these mixed input values and calculate accuracy. To examine the effect of mixed input values, we also train the models with a training dataset that includes the mixed range of input samples (mixed high-low data). Lastly, we randomly assign 70% of the samples to a training set and the rest 30% to the testing set to measure the overall accuracy of our model.

Prediction and extrapolation on both PNN and BR-BPNN models generate scalar values of basic block counts. To compare the predicted and actual basic block counts, we use a normalized error metric, Mean Squared Error (MSE), defined as follows,

where is the number of samples, is the predicted counts, and is the actual basic block count. We also compute the Pearson correlation coefficient [29] , and Spearman correlation coefficient [30] between predicted values and actual values of basic block execution counts for each benchmark. Pearson correlation measures the strength of the linear relationship between predicted and measured counts. Spearman correlation assesses whether there is a linear relationship between variables equal to the Pearson correlation between the rank values of those two variables.

We choose 16 broadly used GPU benchmarks listed in Table 1 to verify our prediction models. We use 12 applications from the Polybench benchmark suite [15] and four applications from the Rodinia benchmark suite [11] . We train both PNN and BR-BPNN networks using these benchmarks with the different training and testing datasets as described in Section 4.3. Then, we compare the accuracy of both models in search of a generalized network to model basic block execution counts.

First, we evaluate the performance of our trained models using the test dataset from the random splitting of the sample set. We calculate the MSE with respect to the ground truth for each benchmark. From Figure 4 (a), we find that both PNN & BR-BPNN network can predict basic block counts for unseen test inputs from the random dataset with an overall MSE of less than 0.175. For all benchmarks except for gemm, gaussian, mvt, lud, lu, and correlation, MSE on both networks is almost aligned and close to zero (MSE < 0.015). Other applications have less than 0.080 MSE error for random test set prediction except for the lu benchmark. On average, the PNN model shows 97.7% accuracy while the BR-BPNN model shows 98.1% accuracy for random prediction. Figure 4 (b) shows the extrapolation of basic block counts on high input values when the model is trained using low input values as well as mixed input values. We find that the extrapolation of PNN is very close to BR-BNN for almost all the applications except for gemm, pathfinder, lud, and covariance benchmarks. PNN has significantly less error than BR-BPNN for gemm, pathfinder, gesummv, mvt, lu, covariance, and doitgen benchmark. On average, the PNN model shows 93.6% accuracy, while the BR-BPNN model shows 92.8% accuracy for extrapolation learned from mixed input configuration.

We find that the PNN network shows better accuracy for extrapolation than the BR-BPNN network. To further investigate this observation, we separate the low input values and high input values from the dataset by setting a cut-off value. We train the PNN and BR-BPNN networks with samples that have only low input values for each benchmark. We evaluated both models with extreme high values from the test dataset. Table 2 benchmarks. The higher correlations values indicate a better linear relationship between extrapolation and actual basic block counts in PNN. We find significantly worse accuracy for lud, gaussian, and lu on both networks. Investigating these datasets shows that they have a significant number of zero basic block counts and some of the counts are higher for low input instances. In contrast, others are lower for high input values. On average, the PNN model shows 93.5% accuracy, while the BR-BPNN model shows 95.6% accuracy for extrapolation learned only from low input configuration.

A better representation of extrapolation is shown in Figure 5 for PNN. The heatmaps show extrapolation of basic block execution counts on high input values with respect to the actual basic block counts for those inputs. Each heatmap includes all basic block predictions for one benchmark, and the extrapolated counts should be close to the actual count. We expect all points along the identity line ( = ). We find higher density near the identity line for most benchmarks except gaussian, lu, lud, and nw. The heatmaps represent the closeness of our extrapolation counts with actual counts.

In summary, we observe that both PNN and BR-BPNN networks can accurately extrapolate on basic block counts. Both networks have an almost similar average error rate for the benchmarks used in this study. From extrapolation on high and mixed input instances, we find that BR-BPNN works better for linearly separable problems while PNN fits better for applications with non-linear behavior. In other words, PNN is a more generalized network than BR-BPNN network. Figures 6 (a) and (b) show the change in test accuracy when we vary the training sample percentage for both PNN and BR-BPPNN networks. This plot represents the relationship between model accuracy and training size. For both networks, high test accuracy for a small fraction of training samples indicates that we have sufficient samples for each benchmark application. We find that test accuracy decreases with higher samples for some benchmarks. Figure 6 (a) shows that the benchmarks nw, correlation, and covariance fit better with a small sample size for PNN. Although the accuracy for those applications shows slight improvement initially with increasing sample size, it decreases later with more samples. The benchmarks gramschmit and lu show similar behavior for BR-BPNN in Figure 6 (b). We also observe that BR-BPNN shows steady accuracy for most applications while PNN is more sensitive to the sample size. Thus, we conclude that test accuracy slightly increases for most benchmarks with increasing training sample size for both networks.

This work presents a Basic Block count prediction tool, BB-ML. BB-ML uses probabilistic and regression networks to predict GPUs' basic block execution counts. We build PNN, a probabilistic Poisson Deep Neural Network, to predict and extrapolate GPU applications' basic block execution counts. Also, we employ BR-BPNN, a regression network for basic block counts extrapolation and compare it with PNN using benchmarks from the two benchmark suites. Results suggest that both networks can extrapolate basic block count for high input values and random input values of the application. BR-BPNN can better fit linearly separable applications. PNN is more generalized and shows significant accuracy for linear and non-linear applications.

We will explore combining both PNN and BR-BPNN into a network to capture the advantages of both networks. Also, we will show the utility of BB-ML in larger applications such as ML applications. Furthermore, we will integrate BB-ML into a modeling and simulation framework to harness its utility.

Home -| TOP500

LLVM Machine Code Analyzer

NVIDIA Ampere Architecture In-Depth | NVIDIA Developer Blog

Programming Guide :: CUDA Toolkit Documentation

Intel Architecture Code Analyzer

{TensorFlow}: A System for {Large-Scale} Machine Learning

Accurate Throughput Prediction of Basic Blocks on Recent Intel Microarchitectures

Hybrid, scalable, trace-driven performance modeling of GPGPUs

Ppt-multicore: Performance prediction of openmp applications using reuse profiles and analytical modeling

Estimating the WCET of GPU-Accelerated Applications Using Hybrid Analysis

Rodinia: A benchmark suite for heterogeneous computing

Machine Learning-Enabled Scalable Performance Prediction of Scientific Codes

A Bayesian regularization-backpropagation neural network model for peeling computations

Auto-tuning a high-level language targeted to GPU codes

Deep learning with Keras

COVID-19 patient count prediction using LSTM

Predictive Abilities of Bayesian Regularization and Levenberg-Marquardt Algorithms in Artificial Neural Networks: A

Ithemal: Accurate, portable and fast basic block throughput estimation using deep neural networks

Biomechanical Study and Prediction of Lower Extremity Joint Movements Using Bayesian Regularization-Based Backpropagation Neural Network

Bayesian learning for neural networks

Software bug count prediction via AdaBoost. R-ET

Backpropagation neural network-based machine learning model for prediction of soil friction angle

Pytorch: An imperative style, high-performance deep learning library

Advanced supervised learning in multi-layer perceptrons -From backpropagation to adaptive learning algorithms

Quasi-Newton methods for training neural networks

Nvbit: A dynamic binary instrumentation framework for nvidia gpus

Pearson correlation coefficient

Wikimedia Foundation

Spearman's Rank Correlation Coefficient

Citation count prediction: learning to estimate future citations for literature

A predictive model for solar photovoltaic power using the Levenberg-Marquardt and Bayesian regularization algorithms and real-time weather data

Predicting HPC parallel program performance based on LLVM compiler