key: cord-0576630-i7i12r6c authors: Kulkarni, Sourabh; Moritz, Csaba Andras title: Efficient State-space Exploration in Massively Parallel Simulation Based Inference date: 2021-06-29 journal: nan DOI: nan sha: f797ccc771d4525df874dda69d1475a31cc87984 doc_id: 576630 cord_uid: i7i12r6c Simulation-based Inference (SBI) is a widely used set of algorithms to learn the parameters of complex scientific simulation models. While primarily run on CPUs in HPC clusters, these algorithms have been shown to scale in performance when developed to be run on massively parallel architectures such as GPUs. While parallelizing existing SBI algorithms provides us with performance gains, this might not be the most efficient way to utilize the achieved parallelism. This work proposes a new algorithm, that builds on an existing SBI method - Approximate Bayesian Computation with Sequential Monte Carlo(ABC-SMC). This new algorithm is designed to utilize the parallelism not only for performance gain, but also toward qualitative benefits in the learnt parameters. The key idea is to replace the notion of a single 'step-size' hyperparameter, which governs how the state space of parameters is explored during learning, with step-sizes sampled from a tuned Beta distribution. This allows this new ABC-SMC algorithm to more efficiently explore the state-space of the parameters being learnt. We test the effectiveness of the proposed algorithm to learn parameters for an epidemiology model running on a Tesla T4 GPU. Compared to the parallelized state-of-the-art SBI algorithm, we get similar quality results in $sim 100$x fewer simulations and observe ~80x lower run-to-run variance across 10 independent trials. S IMULATION-Based Inference (SBI) shows great promise in a wide variety of scientific domains. It is used to learn the parameters of complex simulation models which attempt to capture the underlying physical processes. The learned parameters can then be used to generate predictions from the model. To give a few examples, SBI can be used to learn key parameters for elementary particle collision experiments [1] , in biology to study the dynamics of protein networks [2] , and in epidemiology for learning the key parameters of an epidemic such as COIVD-19 to understand and predict the spread as well as to inform large-scale interventions [3] . Other areas of science where SBI algorithms are used are cosmology [4] , [5] , cognitive science [6] , econometrics [7] , and systems biology [2] , [8] , [9] , [10] . SBI algorithms have been widely used in these disciplines because, unlike other statistical inference algorithms, SBI does not require additional criteria to be satisfied to perform parameter learning for a model -simply the ability to simulate the model suffices. Typically, alternatives to SBI algorithms are the family of Markov Chain Monte Carlo algorithms (MCMC) [11] , which require the ability to compute the 'likelihood function' over the model, which is intractable for many scientific models [12] . For example, in modelling an epidemic, we must consider subpopulations which are often not observable (e.g., the number of untested infections), and in doing so, the epidemiological model ends up having an intractable likelihood function. In such scenarios, SBI algorithms are the only option for parameter learning. Primarily, these SBI methods are implemented on CPUs [13] . This is not efficient, as these algorithms have a lot of potential for parallelism and hence can benefit from GPUs and other massively parallelized compute architectures. At the core of all simulation-based inference algorithms is the ability to compute a large number of simulations with randomly sampled parameters. The simulations are then compared with real-world observations to gauge the quality of the parameters used in the simulation. The algorithms then iteratively converge to the best possible parameters. These massive number of simulations can, in most SBI algorithms, be performed in an embarrassingly parallel fashion per iteration [12] . Parallelized versions of existing SBI algorithms have been developed and have shown to provide performance benefits [14] , [15] . However, they are not the best ways to utilize the achieved parallelism, as they were originally designed to perform best in the regime of progressing one simulation at a time. As part of this work, we perform experiments with these massively parallel versions of existing SBI algorithms to show that they have inconsistent results across independent runs. They also tend to get stuck in local minima and hence provide sub-optimal results. With the ability to perform up to 100K simulations in parallel at a time [14] , [15] , we require novel ways to obtain the best possible progress from them and to avoid the issues that arise with using parallelized versions of existing algorithms. In this work, we propose a new simulation-based inference algorithm, which is designed to perform best in such a massively parallel regime. The algorithm builds on an existing SBI algorithm, known as Approximate Bayesian Computation with Sequential Monte Carlo (ABC-SMC). The proposed algorithm exploits the achieved parallelism not arXiv:2106.15508v1 [cs.DC] 29 Jun 2021 only for performance gain, but also for a qualitative improvement in the results achieved. The key idea of this new algorithm is an efficient method of parameters' state-space exploration; this is achieved by a novel way to repurpose the notion of 'step-size' in massively parallel stochastic optimization processes. The 'step-size' (also known as the 'learning rate' in some literature), in SBI and other stochastic optimization algorithms in general, is an important hyperparameter. This hyperparameter, at each stage of the algorithm, captures the amplitude of the change introduced to the parameters being evaluated. In current methods, the step size is usually a single number, and is tuned carefully as the algorithm progresses, usually having greater values initially, and lower values as the algorithm approaches convergence. The key contribution of this work is the introduction of a new concept -step-size distribution. When 100K simulations occur in parallel, using a single step size does not introduce enough 'diversity' in the parameters to make efficient use of all those simulations toward the convergence process. The proposed method, on the other hand, allows each of those 100K simulations to have an independent step-size which is sampled from the step-size distribution, which is a tuned Beta distribution. Instead of using complex methods for tuning the single step size, the proposed method uses simple ratios that encode the progress of the algorithm to tune the PDF of the Beta step-size distribution. We call this new algorithm 'Parallel ABC-SMC with Beta-Distributed Step-Sizes', or P-ABC-SMC BDSS in short. P-ABC-SMC BDSS is evaluated for its effectiveness to perform parameter learning for a stochastic epidemiology model used for understanding and predicting the spread of COVID-19. The algorithm is compared with the parallelized version of the current state-of-the-art ABC-SMC algorithm, where step-size tuning is performed using MCMC. This baseline is called P-ABC-SMC MCMC in short. Experiments are performed on a Nvidia Tesla T4 GPU. We show that the new method provides better quality of learned parameters in fewer number of simulation steps, and better consistency across independent trials. Experiments of 10 independent trials showed consistently better results -the P-ABC-SMC BDSS obtained better results than P-ABC-SMC MCMC in ∼ 100x fewer simulations, and ∼ 80x lower run-to-run variance. Experiments also confirm that P-ABC-SMC BDSS's performance efficiency improves with increasing degree of parallelism. While we use the epidemiology model as an example, P-ABC-SMC BDSS should be effective wherever the ABC-SMC method is used. Furthermore, this work shows promise not only for the ABC-SMC algorithm, but for all stochastic optimization algorithms. In these algorithms, there exists a method of tuning step-sizes. These methods could be replaced instead by the notion of a tuned step-size distribution -like the one presented in this work. This could open up new possibilities of developing parallelized versions of these stochastic optimization algorithms -consequently obtaining performance gains on contemporary parallel hardware architectures. The rest of the paper is organized as follows: Section II provides background information on the epidemiology model used to test the new algorithm, and the basics of SBI algorithms and the current state-of-the-art ABC-SMC MCMC, over which the proposed algorithm improves upon; Section III Describes in detail the proposed P-ABC-SMC BDSS algorithm and its implementation; Section IV Describes the experimental setup, and the results obtained; Section V discusses the results and implications of this work to the wider domain of stochastic optimization in the massively parallel regime. In this work, we are proposing a new algorithm to perform SBI on a massively parallel scale. We are demonstrating the effectiveness of this algorithm by comparing it to a parallelized version of the current state-of-the-art SBI. To make this comparison, both are used in an epidemiological application to understand and predict the spread of COVID-19 in a nation given their case data. This is done by using a stochastic epidemiological model, which can be used to simulate, on a per day basis, the case outcomes of a population given that we know the key epidemiological parameters. Hence, the task for both the SBI algorithms is to find a distribution over parameters that can i) best explain the currently observed case data for a certain country, and ii) accurately predict the future case data for that country. In this section, we shall first establish the mathematical notations that will be used throughout the paper, which are typical of these statistical inference methods, but differ from other ML literature. Secondly, we describe in detail the epidemiological model. Then we shall discuss how SBI algorithms are used to learn parameters for the model using real-world case data. Third, we shall discuss the current SBI algorithms -the Approximate Bayesian Computation (ABC) and a more advanced version -the ABC with Sequential Monte Carlo (ABC-SMC). In statistical inference, a model is defined to be a joint probability distribution over its parameters (denoted by θ), and its observed variables (denoted by x). The model as a probability distribution is hence denoted as p(θ, x). The parameters are known to exist in a certain domain a priori, either due to expert knowledge, or according to how the model is designed. This distribution in which the parameters belong before the learning process starts is known as the prior and is denoted by π(θ). The models are generative in nature and can be simulated for a given set of parameters to generate observations D s ∼ p(x|θ). The evidence, or realworld observations, are denoted by D o . The likelihood of observing D o for a given set of parameters theta is called the likelihood function L = p(D o |θ). The distribution obtained after the parameters are learnt using the observed data D o is called the posterior p(θ|x = D o ). It is important to note that these statistical 'inference' algorithms are solving a 'learning' problem, and the expression 'parameter inference' in the statistics literature corresponds to 'parameter learning' in ML literature. This should not be confused with 'inference' as used in the ML literature, which often means the process of estimating an output with a model, given the input. In this paper, the term Fig. 1 : Overview of the epidemiology model flow. The population is divided into 6 sub-populations. On a daily basis, the number of people moving from one sub-population to the other are simulated with a Poisson process, where the rate parameters of these processes are given by the current sub-populations and the transmission parameters (infection rate, death rate, etc.). The transition from susceptible to infected is a function of the observed sub-populations which captures the response of a population to an increasing number of cases [15] . 'parameter inference' from statistical inference literature is referred to as 'parameter learning'. The epidemiology model considered in this work belongs to the class known as compartmental models. In this model class, the population is divided into several subpopulations, and the spread of infectious disease is modelled as the flow of people from one sub-population to other. The model considered in this work [3] attempts to capture the spread of COVID-19 using six subpopulations, three observed and three unobserved. We include a brief overview of the model, but refer the reader to the Supplementary Material of [3] for further detail. The transmission across these sub-populations is modeled with a Poisson process approximated in discrete 1-day timebins using the tauleaping method [16] . The model consists of 8 parameters: with a uniform prior distribution: These prior values were taken as-is from the original model description [3] . They signify the reasonable ranges in which the parameters of interest could lie. When we simulate the model with a sample set of these parameters,we get the following state vector: which consists of the sub-populations of Susceptible people, undocumented Infected, Active confirmed cases, confirmed Recoveries, confirmed fatalities Dying, and unconfirmed R u emoved. The Removed sub-population, R u , comprises those who have recovered or died, but have not been tested. This simulation is typically performed over several days or months and the generated data can be compared with the real-world values of the observable subpopulations. One of the key challenges in this model is that the state vector X is partially observed; i.e. only the A, R, D values are available from observed data. This makes the likelihood function p(D|θ) intractable for this model, as the unobserved sub-populations of the model S, I, R u are required to be 'integrated-out' -the likelihood function is defined over all possible values of those three subpopulations, and the only way to obtain the likelihood values for use in the learning process would be to integrate the likelihood function over all possible values of those three sub-populations. Instead, simulation-based inference such as ABC is used to perform inference over this model (See next subsections). Parameter α 0 refers to the base infection rate, while α is the coefficient of the function that captures the changes in infection rate based on the observed sub-populations (A,R,D). n is the exponent t o the function. Based on these parameters, the total infection rate is assumed to follow: This function can be modified to capture additional changes to the infection rate based on A, R, D o values or even using external data. The parameters γ, β, and δ are the positive test rate, recovery rate and fatality rate respectively. The parameter η captures the effectiveness of testing protocols, as the rate for unconfirmed infected to be recovered without ever being confirmed is given by ηβ. The initial value parameter, κ, encodes the number of unobserved infected cases, as a fraction of A, at the start of the simulation. The underlying COVID-19 time-series data, provided by Johns Hopkins University [17] , contains daily case numbers for [A, R, D]. In its first step, the model initializes the remaining variables with R u = 0, I 0 = κ * A 0 , and S = P − (A 0 + R 0 + D 0 + I 0 ) with P being the total population count at the first time point. The second step is to calculate the hazard function h which provides the average update of the model parameters within one day h(S, I, A, R, D, R u ) = gS I P , γI, βA, δA, βηI . with g described in equation 4. The third step is to sample the transmission numbers from a Poisson distribution with the rate parameter set according to these average numbers. Instead of a Poisson sampling with h as parameter, we chose an approximation with normal distributions with mean h and variance √ h and use the floor of the numbers. This approximation allows us to perform highly optimized parallel simulations as seen in the next section. The fourth step is to apply the sampled transmission amounts to obtain updated numbers for the next day (S → The second to fourth steps are repeated for the number of days for which case data is available. Eventually, the numbers for A, R, and D o can be compared to the real measurements. The Bayesian statistical inference approach of learning parameters is through obtaining the posterior over parameters θ for a model p(θ, x) and given observations D o , which is given by Bayes' rule, As discussed earlier, the likelihood function p(D o |θ) is intractable, since S t , I t , and R u t are unobserved. This precludes some approximate Bayesian inference methods such as MCMC. In the ABC approach, the model simulations are utilized to perform parameter learning. First, we sample the parameters θ from their prior θ * ∼ π(θ). Next, we simulate a forward pass of the simulator (as described in Section 2.2 for example) to generate observations D s ∼ p(x|θ = θ * ) over the number of days we have data for. The simulated observations are then compared to the real-world evidence using a distance function dist(D s , D o ). For this model we used the Euclidean distance [3] . Finally, the sampled parameters θ * are accepted if the distance function is less than a certain tolerance value , dist(D s , D o ) ≤ . This is repeated until we accept the target number of posterior samples. It can be shown that as tolerance approaches 0, the approximate posterior converges to the true posterior [18] . The approximation is also better with more number of posterior samples. Fig. 2 shows an overview of the ABC process as a flow chart. To summarize, in ABC we aim to obtain samples from an approximation to the posterior: where D o is the ground truth data, D s is simulated data depending on θ, and p(θ) is the prior [3] . The dist function is the Euclidean distance [3] . The number of simulations that need to be performed to obtain samples below a certain tolerance value increases exponentially, the tolerance value goes lower. Hence using vanilla ABC algorithm becomes infeasible to achieve arbitrarily small tolerance values. Instead of choosing a fixed tolerance, sequential Monte Carlo can be used to transform an initial set of samples to a high quality set with a decreasing sequence of tolerances and using ABC. This algorithm is called SMC-ABC [3] , [19] , and shall be explained in the next subsection. By combining the ABC algorithm with the Sequential Monte Carlo process, we get ABC-SMC. In essence, the ABC-SMC algorithm performs repeated applications of the Bayes rule (see Eq. 6) via the ABC process to obtain better and better quality posterior samples. In ABC-SMC, the algorithm starts with a large tolerance value 0 . Then, through a sequence of several stages, the algorithm obtains sequentially lower tolerance values until the target tolerance value t is reached. In each of its stages i, ABC-SMC processes the posterior of previous stage p i−1 (θ|x = D o ) to form the prior of the current stage π i (θ). In turn, the posterior of current stage becomes the prior of the next stage. Fig. 3 provides an overview of the ABC-SMC algorithm. As the figure shows, the algorithm builds on the primary ABC framework, and each stage of the ABC-SMC resembles the ABC algorithm. Key differences between ABC and an ABC-SMC stage are: i) preparation of current stage's prior from previous stage's posterior, ii) how parameters are sampled from this prior, iii) the notion of introducing perturbations to sampled parameters, and iv) how the tolerance is computed for the next stage based on the samples accepted in the current stage. We shall now go through in detail how each of these steps are performed. The initial stage of ABC-SMC, since there is no previous stage to obtain samples from, is a simple ABC stage with a large tolerance value 0 . When the required number of samples N are obtained, they are used to determine the tolerance for next stage 1 . For all stages, (including the 0 th stage) This is done with the help of the survival ratio α: Here, the survival ratio α is a hyperparameter. For the purposes of this work, we set it to α = 0.5. Hence, at each stage, the distance metric dist(D s , D o ) of the 50 th percentile (i.e. median) of the accepted samples at stage i becomes the tolerance for next stage i+1 . Naturally, for all values of α ∈ (0, 1) the tolerance values are strictly non-increasing. At the start of all stages other than 0 th , the prior of current stage i is computed using the posterior of previous stage p i−1 (θ|x = D o ) and the tolerance of current stage i as follows: i.e., the prior of current stage π i (θ) is the set of all parameters θ in the posterior of previous stage p i−1 (θ|x = D o ), that have a distance metric D θ smaller than the tolerance i . For a survival ratio α = 0.5, only the top half of the samples from previous stage's posterior are included in the current stage's prior. Next, each of the samples in the current stage's prior π i (θ) are assigned a weight w i (θ): Hence, a sample is given a higher weight, the lower its distance metric is compared to the current tolerance i . These weights are then used to perform weighted random draws θ * from π i (θ): After drawing a sample from prior θ * , a perturbation is introduced using a Gaussian random walk: Here, the variance of the Gaussian used to perform a random walk step is called the step size s. This ensures that new samples are explored in the vicinity of the samples accepted in the last stage in hopes of finding samples with a lower distance metric. For the perturbations to be effective, the step size needs to be tuned carefully -too small, and the ABC-SMC process gets stuck in a local minima; too large, and the process shall no longer obtain any good samples. After obtaining a perturbed sample θ * * , the subsequent steps are similar to ABC -simulate model, compute distance function, and perform accept/reject. We also maintain the ratio of accepted to rejected samples θ * * , which is used to tune the step size s. It is here, at the steps of tuning the step size and introducing perturbations, where the proposed algorithm differs from the current state-of-the-art. Hence, to understand the significance of the proposed algorithm, it is essential we review in-depth the step-size tuning and perturbation steps as they are in the current state-of-the-art algorithm. In the state-of-the-art version of ABC-SMC [3] , the step size tuning is performed by using Metropolis-Hastings (MH) Algorithm [20] , which is one of the most widely used MCMC algorithms. The purpose of using this algorithm is two-fold -i) to ensure that the perturbation is taking the parameter in a better region than the original location, and ii) to tune the step size so that the steps taken during perturbation are optimal. In this algorithm, the parameter perturbation θ * * is considered as a proposal. For this proposal, the algorithm computes two transition probabilities -one that of transitioning from θ * to θ * * , and the other way around from θ * * to θ * . Using these transition probabilities, the algorithm computes a 'metropolis acceptance ratio' A M (θ * * , θ * ) as follows: The proposal θ * * is only accepted if A M (θ * * , θ * ) > U (0, 1) where U (0, 1) is a random draw from a uniform distribution between 0 and 1. Of the θ * * that is accepted for simulation, we compute another empirical acceptance ratio -by finding out how many of these perturbed parameters achieve the distance metric dist θ * * below the tolerance i : In MH, a step-size adaptation is done to maintain this empirical acceptance ratio as close to possible to the target acceptance ratio A T = 0.234: This adaptation is performed usually in the first 10% of the ABC-SMC process, after which the rest of the process continues with a fixed step-size. This is done to avoid the step size becoming infinitesimal, which leads to the ABC-SMC process coming to a complete stop. In this section, we shall first describe how we develop a massively parallel version of the current ABC-SMC MCMC algorithm for the epidemiology model. Then, we shall discuss some of the issues encountered in this approach and the sources of incompatibility of MCMC-based stepsize tuning and massively parallel simulations. Then, we shall go in-depth on the proposed algorithm which takes a novel approach on how the concept of step size can be reinvigorated for the massively parallel simulation regime. In earlier work [14] , [15] , we explored the possibility of massively parallel simulations of the epidemiology model for COVID-19. We achieved up to 500k parallel 50-day simulations for the model. Using those massively parallel simulations, we show that ABC for parameter learning can be accelerated using GPUs and other emerging hardware accelerators. In this work, as a first step and to set up a baseline, we do the same for the current state-of-the-art ABC-SMC MCMC algorithm, i.e. to develop P-ABC-SMC MCMC. The parallelized simulation kernel used in the previous work's ABC process is used as-is. This demonstrates the re-usability and modularity of the proposed approach. We keep the degree of parallelism to 100k, but the simulation is done for 120 days. In this work, the challenge is to perform parallelized draws of θ * (equation 11), parallelized perturbations for θ * * , and most importantly, parallelized MH algorithm for step-size tuning. The parallelized draws from the current stage's prior π i (θ) were enabled by the tensorized categorical distribution. For parallelized perturbations through Gaussian random walk, we employ the reparamterization trick similar to one performed in variational auto-encoders [21] , where all parameters are sampled from U (0, 1) and then transformed into their corresponding domains. For step-size tuning, the adaptation is done once per batch of 100k parallel simulations. During the simulation, we compute the empirical acceptance ratio which is then used to tune the step size for the next batch of 100k parallel simulations. This is done for the first 10% runs of the P-ABC-SMC process. This approach provides tremendous performance gains -the MATLAB code of the non-parallelized state-of-the-art (ABC-SMC MCMC) implemented in [13] on a Xeon CPU, reportedly takes ∼ 2 hours, while the parallelized version developed by us (P-ABC-SMC MCMC), run on the Nvidia Tesla T4, takes only ∼ 400 seconds, which represents a ∼ 18x performance gain. This provides a great baseline for the proposed algorithm, which performs even better as evidenced in later sections. As discussed earlier, the main contribution of the proposed algorithm is generalizing the concept of step-size in a massively parallel simulation regime. This is done by replacing the 'regular' Gaussian random walk with a MCMC tuned step size (described in the earlier section, see Equations 12,13,14,15) with a hierarchical Gaussian random walk, where step size of the random walk process itself is sampled from a tuned Beta distribution. Hence, each of the 100k simulations being run in parallel have their own unique step size, which is sampled from a Beta distribution. The resultant distribution of perturbations now contains a homogeneous mix of step sizes which efficiently explore the parameter space. In this section, we shall first go through the high-level motivation in the development of this algorithm, followed by a detailed description of the algorithm. One of the key concepts in machine learning and stochastic optimization is that of the exploration-exploitation tradeoff. The goal of a ML algorithm is to find the best set of parameters for a model given data. This involves iteratively taking steps in the parameter state space. The size of the steps being taken at any given iteration of the algorithm is one of the most important aspects of the algorithm. Initially, the step-size is large to quickly find areas of lower metric (or 'loss') values. This is known as the exploration phase of the algorithm. In the later stages of the algorithm, the step size is small, to avoid moving out of the current minima and to obtain the best possible local value, known as exploitation. Hence, in the initial stages, the algorithm trades off exploitation for exploration, while in the later stages, it trades off exploration for exploitation. In the massively parallel regime, this explorationexploitation trade-off is no longer required -there is enough parallelism for both. By careful design of how step sizes are allocated to the 100k simulations, a careful balance of exploration ıand exploitation can be achieved to get the best of both. While some of the 100k simulations explore new space, others exploit the small gains that can be achieved with small step sizes. This balance provides unique benefits throughout all stages of the ABC-SMC algorithm. The important question, then, is how to strike this balance of allocation of parallel simulations to explore or exploit (and everything in between)? This the the main intuition behind this work -this balance is achieved with the help of a tuned Beta distribution. The Beta distribution is a continuous probability distribution defined in the interval [0, 1]. The shape of Beta distribution's Probability Density Function (PDF) is defined by two shape parameters denoted (confusingly so) as Beta(α, β). It is through changing these shape parameters that we can modify the Beta distribution's PDF. For the purposes of this algorithm, this modified PDF is used a allocation device which determines step sizes for each of the 100k simulations being run in parallel. Hence, the main reason for choosing Beta distribution instead of, for example, a truncated Gaussian, is because of how malleable the PDF is with the use of its shape parameters. These shape parameters provide an efficient way to perform the step-size allocation in a dynamic fashion. As the mass of the PDF shifts closer to the near-zero region (see Figure 4 ), it naturally adapts to this limit. On the other hand, a truncated Gaussian does not provide the same control and its PDF does not change naturally in the near-zero region, but rather is cut-off abruptly. Fig. 4 shows 3 typical distributions of step sizes sampled from 3 different configurations of the Beta shape parameters. The shape parameters are chosen to be indicative of the initial (red), intermediate (yellow), and final (blue) stages of the ABC-SMC algorithm. In the initial stages, the idea is to allow for exploration, which requires that we allow for all possibilities of step sizes, but there should also be a slight bias toward smaller step sizes. This is captured by the Beta(1, 2) distribution. As the ABC-SMC process Step-Sizes (P-ABC-SMC BDSS) algorithm. Highlighted in blue is the section changed from original ABC-SMC towards development of parallel ABC-SMC. Highlighted in red is the section involving the novel perturbation technique, with step-size distribution and sampling of Beta-distributed step-sizes during the perturbation step. The process of Beta distribution tuning is also shown. progresses, it becomes increasingly unlikely that a very large step size would yield better results, so the bias should move increasingly toward lower values and the spread should tighten. These changes are captured by the distributions Beta(0.5, 5) and Beta(0.1, 15 ). It can be clearly seen that through the progression of the P-ABC-SMC algorithm, the MCMC tuned step results in perturbations that are initially moderately exploratory, but quickly become biased toward pure exploitation in the intermediate and late stages. Contrarily, the perturbations emerging from the BDSS approach, while becoming in-creasingly biased toward exploitation, still maintain the exploratory aspect to them thanks to the wider tail of the distributions throughout the P-ABC-SMC algorithm. As shall be seen in the next section, this difference in the distribution of the perturbations between the MCMC and BDSS approaches provides great benefits in terms of avoiding local minima and efficiency in the number simulations required. Although the plots in Fig. 4 show how the typical shapes of the Beta distribution should look like, we still need a way to ensure the shape changes according to the P-ABC-SMC process. To ensure the effective adaption of the shape of the Beta's PDF, we employ tuning steps, which are discussed in detail now. In P-ABC-SMC, the progress of the ABC-SMC algorithm is captured by two metrics -the number of stages completed, and the tolerance of the current stage. In the case of P-ABC-SMC with MCMC tuned step sizes, we did not require to take all of the above factors into consideration -the tuning was performed using the Metropolis acceptance ratio and the empirical acceptance ratio (see background section). In the P-SBC-SMC BDSS, we are not using MCMC, and we also do not perform any intermediate accept/reject steps based on the transition probabilities. Instead, we focus on the two metrics to track the progress of the P-ABC-SMC algorithm and use them to tune the Beta distribution. The perturbation step of the proposed algorithm is as follows: (16) and Here, we can see that the shape parameters of the Beta are tuned using the the current tolerance value i and the current stage of P-ABC-SMC process i. In the regime of operation α ∈ [0, 1], β ∈ [1, ∞), they loosely correspond to the mean and precision (i.e. inverse variance) of the Beta distribution. Hence, by setting α i = i 1 , we move the mean of the step size distribution closer to 0 as we keep getting better tolerances. At the same time, as the number of stages increases, β i = 2i also increases, effectively decreasing the variance. The new P-ABC-SMC BDSS algorithm is summarized in Fig. 6 There are two other methods of parametrization in a Beta distribution -the mean-variance and the modeconcentration parameterizations. We try both of these, but the original α, β parameterization performs the best. Likelihood-based MCMC methods such as Gibbs' Sampling or Metropolis-Hastings [20] , give asymptotic guarantees of converging to the true posterior of the model as the number of samples obtained approaches infinity. This guarantee is in part based on the correctness of the likelihood function. On the other hand, like all SBI methods, ABC-SMC is likelihood-free. In these methods, the likelihood function is approximated by a distance metric, making these algorithms inherently approximate. To accept samples in the approximate posterior, these algorithms have an artificial cut-off of the tolerance value . Hence, the only case of correctness that can be made for SBI algorithms is that as this tolerance value approaches zero, the approximate posterior converges to the true posterior. There are no asymptotic guarantees though; even after an infinite number of ABC-SMC steps, the tolerance is not guaranteed to approach zero. Hence, ABC-SMC algorithms are more accurate when they can manage to accept approximate posterior samples at lower tolerance values. This is applicable to all ABC-SMC algorithms, including the two we test in this work. Both MCMC-based and BDSS approaches to ABC-SMC differ in methods of proposing samples to the ABC step; the actual acceptance and determination of the tolerance for the next stage is done in the same way. Hence we believe that for both of these algorithms the same guarantees hold. Given the lack of theoretical guarantees, we instead validate these algorithms on the ability of approximate posterior samples accepted by them to predict unseen data. This is described in more detail in Section 4.2. To test the baseline and proposed algorithms, we use them to learn parameters for the stochastic epidemiology model (section 2.1). The model has a set of 8 parameters that governs how an epidemic spreads through a population of a nation. The goal of the parameter learning process is to obtain the parameters with the least value of distance metric from the real-world case data. For the purposes of this work, we trained on 120-day case data for Italy. The algorithms are evaluated on three metrics -best tolerance achieved, simulation efficiency, consistency in run-torun variance. The best tolerance achieved demonstrates the quality of samples achieved -lower the tolerance, better the parameters fit to real-world data. The simulation efficiency is measured by how many simulation runs are required before the algorithm reaches it's best possible tolerance. The run-to-run variance describes how consistent the algorithm is, both in lowest tolerance achieved as well as the simulation efficiency. Finally, we shall also evaluate the the learnt parameters in their ability to predict future cases. This is done by using the 120-day data to learn the parameters, and then using those learnt parameters to recreate the case data for those 120 day window as well as 30 additional days. These 30 days act like the test data set and allow us to measure the accuracy of the parameters. To compare the new P-ABC-SMC BDSS with the baseline P-ABC-SMC MCMC, we perform the task of learning model parameters from 120-day case data of Italy. We conduct 10 independent trials to gauge the run-to-run variance of both the algorithms. These trials are conducted for 5 different levels of parallelism -10, 100, 1k, 10k, and 100k simulations per run, respectively. The number of samples to be collected is set to 1000. For consistency, we establish a stopping criterion -the P-ABC-SMC inference process stopped on the stage which reached 1000 simulation runs. Since there could still be additional simulation runs required for that stage to finish, the actual number of runs varies from 1000 to 1900. At each stage, we record the tolerance achieved by each of the algorithms. This allows us to compare the quality of the samples (tolerance achieved) and the number of simulation runs required to obatain them. Hence, with these two metrics, we generate 5 plots, one for each level of parallelism in Fig 7. Fig. 7 provides a detailed visualization of the comparison between the two approaches. When the degree of parallelism is low (see Fig. 7a ) with only 10 simulations per run, the P-ABC-SMC MCMC approach is clearly superior, as the proper tuning of the individual step size yields better results than obtaining 10 samples from a Beta distribution with wide tails. With increasing degree of parallelism, though, the P-ABC-SMC BDSS starts to show its benefit. At just 100 simulations per run (see Fig 7b) , we can observe that while P-ABC-SMC MCMC observes substantial benefits, P-ABC-SMC BDSS performs even better. Half of the trials manage to perform comparable to P-ABC-SMC MCMC, the other half comfortably outperform, eventually approaching the best tolerance values achieved by even the later experiments with higher degree of parallelism, though they take the entirety of their allocated simulation-runs budget to do so. With the number of simulations per run up to 1000 (see Fig 7c) , we observe that 3 of the 10 P-ABC-SMC MCMC trials break the apparent local minima at the tolerance value of ∼ 10, and begin to converge toward lower values, though never converging fully. On the other hand, P-ABC-SMC BDSS manages to converge to the best achieved tolerance in 6 of the 10 runs, and does so in just 250 simulation runs instead of the 1000-run limit. The performance gap widens further with 10,000 simulations per run (see Fig 7d) , Where the performance gain of P-ABC-SMC MCMC is only slight, while P-ABC-SMC BDSS manages to converge fully in 9 of the 10 trials, and all of them converge in less than 150 simulation runs. Finally, at 100k simulations per run (see Fig 7e) , we see benefits in both P-ABC-SMC MCMC and BDSS, although the performance of the BDSS approach is vastly better. All 10 trials consistently reach close to the best tolerance in a span of just 10-20 simulation runs. P-ABC-SMC with MCMC slowly approaches the best achieved tolerance by the proposed algorithm, but the process stops short in all trials. One of the trial still remained stuck at the local minima. Hence, it is clear that the BDSS approach is clearly superior to the ABC-SMC MCMC approach, especially when employed in the massively parallel regime of P-ABC-SMC, for which it was developed. It can also be seen that while it performs best in degrees of parallelism of 100k, it is also beneficial to use in the regime of parallelism as low as 100 as that is when it starts to show benefits over P-ABC-SMC MCMC. The reasons for these results and their implications are discussed further in the next section. P-ABC-SMC BDSS achieves a lower average tolerance value of 3.44 in 10 runs than the P-ABC-SMC MCMC can achieve in ∼ 1300 runs, which is 4.01. Hence, we report that the proposed algorithm utilizes ∼ 100× fewer simulations to achieve better tolerance values. As shown in Section 4.3 Moreover, at just 10 runs, the variance of the tolerance across the 10 independent trials is just 0.002, while the variance of P-ABC-SMC MCMC at ∼ 1300 runs is 0.16. 1 . We use the parameters learnt from the 120-day training data are used to simulate the case-data of those 120 days and an additional 30 days. The case data in question is the number of confirmed active (A), confirmed recovered (R), and confirmed deaths (D) for Italy. If the learnt parameters are good, the simulated case data should closely resemble the realworld case data. It is important to note that the similarity of simulated case data should be considered as being 'samples from the same underlying distribution' rather than a direct match to the real-world data. The purpose of learning the parameter distributions of the epidemiology model is to capture the underlying natural process of the epidemic, so that the simulated data from all posterior samples is the model's way of capturing all realistic trajectories the casedata could take, given the parameters we learnt from the one real-world trajectory of case data we observed. One of the key strengths of the proposed algorithm is its efficiency in the number of simulations required to obtain good solutions in the parameter space. This is exemplified in the experiments performed here. We create simulated the 120-day case-data for Italy using 'snapshots' of intermediate posterior parameters when both algorithms are 10 and 1000 runs into the ABC-SMC process respectively. The degree of parallelism is set to 100k simulations per run. The number of samples accepted is 1000. For each algorithm, we plot the median along with the 99 th percentile range of the case-data simulations generated from those 1000 parameter samples. For reference, we also plot the real-world case data in black. Fig. 8 shows plots generated from these experiments. As seen in Fig. 8a the parameters obtained by P-ABC-SMC MCMC algorithm in 10 simulation runs are not even close to converge and do not generate simulated case data close to the real-world case-data for Italy. On the other hand, already at 10 simulation runs, P-ABC-SMC BDSS generates simulated case-data that shows a really good fit to realworld case-data (See Fig. 8b) . Fig. 8c and 8d show how the simulated case-data for P-ABC-SMC MCMC and P-ABC-SMC BDSS using parameters at 1000 simulation runs. Here we see that both are now producing accurate simulations of case data that closely match the real-world case data. Though we can also observe that P-ABC-SMC BDSS is generating comparable plots in just 10 simulation runs 8b. Hence, the proposed algorithm can be terminated much sooner to obtain the same (or better) quality result than P-ABC-SMC MCMC. 1 . These values are computed by excluding the outlier with 9.7 tolerance in case of P-ABC-SMC MCMC in Fig. 7e . Fig. 9 : Runtime Comparison between P-ABC-SMC MCMC and BDSS. The plot shows the distribution of per-simulation runtimes of both algorithms, for a batch size of 100k, across 10 trials. The runtimes are very close to each other, with median values around ∼ 170ms. The MCMC runtime has a slightly higher variance. The X's denote the mean runtimes. The experiments were run on a Tesla T4 GPU on the Google Colaboratory 2 . The goal here is to compare the relative runtime performance of the two algorithms, rather than to observe the absolute best performance numbers which could be achieved by using higher-end GPUs. We observe no significant difference in performance across the two ABC-SMC implementations. In both the P-ABC-SMC MCMC and BDSS, we observe an average of ∼ 170ms per 100k simulation run (see Figure 9 ). Hence, with a similar runtime, the simulation efficiency benefits described in 4.1 directly translate to faster runtime, as P-ABC-SMC BDSS achieves better tolerance values in 10 simulation runs than P-ABC-SMC MCMC does in ∼ 1300 runs. We also observe that in both cases, around 50% of the runtime is spent on the actual simulations (i.e., the ABC part of ABC-SMC), while the other half is spent in all the other aspects, such as resampling from previous stage, MCMC/BDSS Step tuning, perturbations, acceptreject, sample post processing etc. Specifically, P-ABC-SMC MCMC reported 84ms of time for actual simulations, while BDSS reported 84.5ms. In this work, we introduce a novel algorithm that is built specifically for utilizing massively parallel hardware architectures such as GPUs to enable ∼ 100× fewer simulations than the current state of the art, while providing better quality and more consistent results across independent trails. The key contribution is the algorthimzation of stepsize allocation across arbitrarily large parallel optimization processes. This algorithm provides a great avenue of utilizing the widely available GPU computing resources for 2. https://colab.research.google.com/ scientific simulation models and cutting-edge simulationbased inference techniques such as ABC-SMC. When we consider the regime of massively parallel simulations, the P-ABC-SMC with the MCMC based approach has two main limitations. The notion of tuning a single step size along the process, and then sampling a large number of perturbations in the parameter space with a single step size is inefficient. Once the step size gets tuned to lower values with MCMC, the optimization process slows down. This leads to scenarios where the P-ABC-SMC process is stuck in a local minima and also leads to inconsistent results across independent trails. On the other hand, P-ABC-SMC BDSS, through the tuned Beta distribution over step sizes, introduces a much wider variety of parameter perturbations across the parallel simulations. This causes the P-ABC-SMC algorithm to obtain a wider range of possible good parameter values, which in turn leads to better results faster. We also see that through the tuning process, the proposed algorithm produces highly consistent results across independent trials, which are better than P-ABC-SMC MCMC, and in much smaller number of simulation runs. Hence, through the ABC-SMC process, the parallel simulations in hardware (P) and the proposed beta-distributed step-size allocation algorithm (BDSS) form a much better approach to accelerate simulation-based inference for scientific simulation models (P-ABC-SMC BDSS). The effectiveness of this algorithm is exemplified in Fig. 8 , which shows that this algorithm has to be run only for 10 runs to obtain similar or better quality results than the 1000+ runs for P-ABC-SMC MCMC. The obtained results are also more consistent, with ∼ 80× lower variance across independent trials. While we demonstrate the effectiveness of this algorithm by performing the novel ABC-SMC inference on a specific compartmental epidemiological model for COVID-19, we believe that the benefits may also translate to the use of this novel ABC-SMC in other scientific models, as the only major change required would be the development of the parallelized version of the scientific simulation model which we wish to learn the parameter distributions for. This can be obtained via the 'parallelization-through-tensorization' approach detailed in [14] , [15] . The key idea is to take a single execution trace of the scientific model's simulation for a specific parameter configuration, and then to add an additional dimension (i.e., tensorize) this trace to now perform the same simulation trace for an arbitrarily large number of parameter configurations. These scientific simulation models, parallelized in this fashion, could be from a vast range of scientific domains, from elementary particle physics [1] to cosmology [4] , [22] (And everything in between [12] ). Furthermore, the results obtained in this work are promising not only for the ABC-SMC algorithm, but also for stochastic optimization algorithms in general. It is theoretically possible to generalize the concept of parallel step-size allocation through a tuned step-size distribution to other stochastic optimization algorithms such as gradient descent, where the size of the step taken in direction of the gradient can be sampled from a distribution instead of being a single value. The cms experiment at the cern lhc Estimating kinetic constants in the michaelis-menten model from one enzymatic assay using approximate bayesian computation Hindsight is 2020 vision: Characterisation of the global response to the COVID-19 pandemic Likelihood-free cosmological inference with type ia supernovae: approximate bayesian computation for a complete treatment of uncertainty Synthetic gaia surveys from the fire cosmological simulations of milky way-mass galaxies Parameter inference for computational cognitive models with approximate bayesian computation Accurate methods for approximate bayesian computation filtering Approximate bayesian computation in evolution and ecology Approximate bayesian computation with deep learning supports a third archaic introgression in asia and oceania A framework for parameter estimation and model selection from experimental data in systems biology using approximate bayesian computation Markov chain monte carlo: an introduction for epidemiologists The frontier of simulation-based inference covid19-autoreg-model (github repository) Accelerating simulation-based inference with emerging ai hardware Hardwareaccelerated simulation-based inference of stochastic epidemiology models for covid-19 Approximate accelerated stochastic simulation of chemically reacting systems An interactive web-based dashboard to track COVID-19 in real time Approximate bayesian computation Estimation of Parameters for Macroparasite Population Evolution Using Approximate Bayesian Computation Monte Carlo sampling methods using Markov chains and their applications Auto-encoding variational bayes High-resolution simulations of non-gaussian cosmic microwave background maps in spherical coordinates The authors would like to thank Facebook Research for their 'Probability and Programming' grant to support this work. He has consulted for several technology companies in Scandinavia and held industrial positions ranging from CEO, to CTO, and to founder. His company, BlueRISC Inc, develops security microprocessors, hardware-assisted security and system assurance solutions for anti-tamper and cyber defense. He is currently a Professor with the Department of Electrical and Computer Engineering, University of Massachusetts, Amherst. His work spans new models of computing and associated nanoscale integrated circuits. His other interests are in cognitive cyber-security (bluerisc.com) and privacy (eprivo.com).