key: cord-0653629-y2f4pddw
authors: Schlembach, Christoph; Schmidt, Sascha L.; Schreyer, Dominik; Wunderlich, Linus
title: Forecasting the Olympic medal distribution during a pandemic: a socio-economic machine learning model
date: 2020-12-08
journal: nan
DOI: nan
sha: 3e5f9dd8330fde1acaf81f9ec96e8d31db96a1c9
doc_id: 653629
cord_uid: y2f4pddw

Forecasting the number of Olympic medals for each nation is highly relevant for different stakeholders: Ex ante, sports betting companies can determine the odds while sponsors and media companies can allocate their resources to promising teams. Ex post, sports politicians and managers can benchmark the performance of their teams and evaluate the drivers of success. To significantly increase the Olympic medal forecasting accuracy, we apply machine learning, more specifically a two-staged Random Forest, thus outperforming more traditional na"ive forecast for three previous Olympics held between 2008 and 2016 for the first time. Regarding the Tokyo 2020 Games in 2021, our model suggests that the United States will lead the Olympic medal table, winning 120 medals, followed by China (87) and Great Britain (74). Intriguingly, we predict that the current COVID-19 pandemic will not significantly alter the medal count as all countries suffer from the pandemic to some extent (data inherent) and limited historical data points on comparable diseases (model inherent).

Forecasting based on socio-economic indicators has a long tradition in academia, in particular in the social sciences. As Johnston (1970, p. 184) noted early, validating the associated "social projections [empirically] may serve to generate appropriate policies or programs whereby we can avoid the pitfalls which would otherwise reduce or eliminate our freedom of action."

Consequently, ever since, there have been endeavours to predict the future, exemplary in the field of economics (Modis, 2013) , public health (Puertas et al., 2020) , civil engineering (Kankal et al., 2011) , ecology (Behrang et al., 2011) or urban planning (Beigl et al., 2004) .

In the economic literature, in particular, accurately forecasting Olympic performances has gained considerable research interest over the last decades (cf. Leeds, 2019) , primarily because such forecasts, typically medal forecasts, are necessary to provide both a government and its citizens with a benchmark against which they can evaluate the nation's Olympic success ex-post. For a government, often investing heavily in athlete training programs to enhance the probability of a nation's Olympic success (cf. Humphreys et al., 2018) , such an assessment is pivotal because it allows them to understand better whether the application of funds, i.e. the taxpayers' money, to their National Olympic Committee (NOC) is productive. For instance, as major sporting events such as the Olympic Games are often associated with increasing both national pride among their citizens (cf. Ball, 1972; Grimes et al., 1974; Allison and Monnington, 2002; Hoffmann et al., 2002; Tcha and Pershin, 2003) and their willingness to begin engaging in sporting activities (cf. Girginov and Hills, 2013; Weed et al., 2015) , thereby reducing long-term healthcare costs, a government might be willing to raise funds if their NOC meets (or exceeds) the medal forecasts. In contrast, because Olympic success is a well-known antecedent of civic willingness to support funding a government's elite athlete training programs (Humphreys et al., 2018) , falling behind the predictions might motivate a government to increase the pressure on the NOC, not least by reducing future funds.

Likewise, accurately forecasting the Olympic success is highly relevant for many nongovernmental different stakeholders. For instance, sports betting companies rely on precise estimates to determine their odds, while both the media and Olympic sponsors must allocate their resources to promising teams and their athletes. Thus, analysing the Olympic Games empirically has become a relevant field of research, both, with a focus on forecasting (e.g. De Bosscher et al., 2006) and beyond (e.g. Streicher et al., 2020) .

Since the first contribution by Ball (1972) , the quality of such Olympic forecasts has steadily improved for two reasons. First, those authors interested in predicting a nation's Olympic success have successively begun employing new estimation techniques. Second, over time, the predictive power of models has gradually increased as authors operating in the field have explored diverse, increasingly extensive data sets.

Since Ball (1972) pioneered with a correlation-based scoring model, forecasting models have continuously become more sophisticated. Initially, as we show in Table A1 in the appendix, most authors referred to the use of ordinary least squares regressions (OLS), as it delivered results that were easy to interpret (e.g. Baimbridge, 1998; Condon et al., 1999; Kuper and Sterken, 2001) . However, a significant challenge when predicting Olympic medals is to reflect the large number of nations without any medal success properly. As the incorporated exponential function punishes small predicted numbers of medals, some authors (e.g. Lui and Suen, 2008; Leeds and Leeds, 2012; Blais-Morisset et al., 2017) , then, moved to Poisson-based models (i.e., a Poisson model, negative binomial model), to tackle this methodological problem.

However, until today, because the dependent variable, typically the number of medals, is censored, most authors have employed Tobit regression to predict Olympic success (e.g. Tcha and Pershin, 2003; Forrest et al., 2015; Rewilak, 2021) . Only recently, employing a two-step approach, estimating the probability of winning any medal before determining the exact number of medals in case of success, became more popular. In particular, both Scelles et al. (2020) and Rewilak (2021) , employing a Mundlak transformation of the Tobit model, could, again, increase the prediction accuracy with their respective Hurdle models. In contrast, other authors, e.g. Hoffmann et al. (2002) , circumvent the underlying methodological problems by splitting their sample into nations who did and did not win any medals in the past, while few authors have employed alternative methodological approaches. 1 However, despite these methodological improvements, a naïve forecast still outperforms all these previous forecasting approaches regularly.

Somewhat similarly, during the last years, authors have significantly increased the data sets used for medal forecasting in three ways. First, by increasing the level of granularity beyond country-specifics; second, by including more years; and third, by exploring additional independent variables.

As a common way to incorporate more granular data, and thus to increase the forecast accuracy, some authors considered predicting the Olympic success by focussing on different sports (e.g. Tcha and Pershin, 2003; Noland and Stahler, 2016a; Vagenas and Palaiothodorou, 2019) , sometimes even exploring data on the level of the individual athlete (Condon et al., 1999; Johnson and Ali, 2004) . Due to the increasing relevance of gender studies, other authors have begun differentiating their data sets by gender (Leeds and Leeds, 2012; Lowen et al., 2016; Noland and Stahler, 2016b) . Noticeably, both approaches are certainly important to answer very specific questions. Yet, macro-level models, in contrast, have the "advantage of averaging the random component inherent in individual competition [leading to] more accurate predictions of national medal totals" (Bernard and Busse, 2004, p. 413 ). Thus, macro-level analysis remains a frequently used approach in Olympic medal forecasting.

A different approach to potentially increase the forecast accuracy is to expand the data set's temporal dimension. As such, some authors have incorporated one hundred years of Olympics and more in their models (e.g. Baimbridge, 1998; Kuper and Sterken, 2001; Trivedi and Zimmer, 2014) . However, more recently, most authors seem to limit the number of events under investigation for three reasons. First, specific incidents such as "large-scale boycotts as occurred at the 1980 Moscow and 1984 Los Angeles Games" (Noland and Stahler, 2016b, p. 178) and "the East German doping program [, which] was responsible for 17 percent of the medals awarded to female athletes" (Noland and Stahler, 2016b, p. 178) in 1972 skewed the medal count in the past. Second, frontiers shifted particularly in the course of two World Wars and the breakdown of the Soviet Union; only since the 1990s nations remained relatively stable (Forrest et al., 2017) . Third, the significance of variables changed over time, such that, for instance, a potential host effect, might play a different role in times where international travel has become a part of our daily lives (Forrest et al., 2017) .

Finally, another way to augment a data set and, thus, to improve the accuracy of a forecast is to incorporate additional independent variables. Early, Ball (1972) found that "[g]ame success is related to the possession of resources, both human and economic, and the centralized forms of political decision-making and authority which maximize their allocation" (p. 198) . Particularly, the extraordinary Olympic performance of countries with a certain political system, at that time the Soviet Union, has been confirmed by research until today, e.g. Scelles et al. (2020) . Other authors (cf. Kuper and Sterken, 2001; Hoffmann et al., 2002; Bernard and Busse, 2004; Johnson and Ali, 2004) found that hosting the Games increases the expected number of medals, among others due to an increased number of fans and reduced stress due to international travel. Maennig and Wellbrock (2008) , Forrest et al. (2010) and Vagenas and Vlachokyriakou (2012) extended this finding and concluded that such a host effect already starts four years before the Olympic Games and, surprisingly, lasts until the subsequent Games. On a similar note, authors found a continuous over-, respectively underperformance of nations, such that lagged medal shares significantly improve the prediction accuracy (cf. Bernard and Busse, 2004). 2 In addition, it is important to mention that many scholars experimented with additional variables such as climate (Hoffmann et al., 2002; Johnson and Ali, 2004) , public spending on recreation (Forrest et al., 2010) , health expenditure, growth rate, unemployment (Vagenas and Vlachokyriakou, 2012) , and income (Kuper and Sterken, 2001) . In general, there are mixed findings on most of these variables and only few are available in public as comprehensive data sets. In this regard, De Bosscher et al. (2006) conducted a meta-analysis of variables predicting sportive success, even beyond Olympic Summer Games, and found that both the Gross National Product and the country's population "consistently explain over 50% of the total variance of international sporting success" (p. 188). It is, therefore, not surprising that these two variables, in particular, have been used by most authors in the past 50 years. As such, also taking the potential issue of multicollinearity from exploring too many distinct, though potentially related, socio-economic variables into consideration, forecasting models should not be augmented infinitely.

Given the high policy relevance and academic attention of Olympic forecasting, it is somewhat surprising that the potential of machine learning in detecting hidden patterns and, thus, improving forecasting accuracy has not yet been exhausted in this context. However, this methodology has recently received an increasing level of popularity in a sports context, e.g. in football (Baboota and Kaur, 2019) . Particularly the Random Forest approach often delivers excellent results, for instance, in forecasting football scores (Groll et al., 2019) or horseracing outcomes (Lessmann et al., 2010) . As acknowledged by Makridakis et al. (2020) , statistical knowledge can be applied in the world of machine learning as well.

As such, in this study, we translate the proven concept of the Tobit model to machine learning by using a two-staged Random Forest model to predict Olympic performance. In that way, we identify the first model to consistently outperform a naïve forecasting model, in three consecutive Summer Games (2008 Games ( , 2012 Games ( , and 2016 by about 3 to 4 percentage points. On a side note, we thus also improve the forecasting accuracy presented in more recent work on the potential determinants of Olympic success (Scelles et al., 2020) by roughly 20 percent. 3

The remainder of our manuscript is structured as follows: After motivating the variables used in the model and introducing the concept of a two-staged Random Forest, we evaluate the quality of the forecast, present an estimate for Tokyo 2020, and discuss the implications of COVID-19. We conclude with a summary, ex-ante and ex-post consequences of the prediction, and an outlook for further research.

We forecast the number of medals in the Tokyo Olympic Games for each participating nation based on a two-staged Random Forest. It is, however, important to note that, as part of this exercise, we also quantify the impact of COVID-19 on the expected Olympic medal count based on the independent variables (i.e., features) national GDP, incidents of and deaths from lower respiratory diseases. In this section, we motivate the underlying variables, explain the concept of a two-staged Random Forest, and describe the forecasting process.

3 The significant increase in predictive accuracy is based on two effects: First, Scelles et al. (2020) build on a generalized linear model in form of a Hurdle, respectively Tobit, model. We, in contrast, apply a two-staged Random Forest algorithm taking into account more complex, non-linear interactions. Second, it is often argued that the time to prepare an Olympic team is four years (cf. Forrest et al. (2010) ; Scelles et al. (2020) . This would imply that, ideally, only socio-economic data until 2016 should be used to predict the Tokyo 2020 results. However, Stekler et al. (2010) evaluate different sports forecasting methodologies and find that more recent data generate better results. Thus, we include data until 2020 in our model to overcome this issue, which is even amplified by the WHO's decision to declare COVID-19 a pandemic (Cascella et al. (2020) and the subsequent postponement of the Games to 2021 (International Olympic Committee (2020).

The number of Olympic medals represents economic and political strength and promotes national prestige (Allison and Monnington, 2002) . De Bosscher et al. (2008, p. 19 ) acknowledge the figure as "the most self-evident and transparent measure of success in high performance sport". As most scholars (e.g. Andreff et al., 2008; Scelles et al., 2020) , we define the number of medals as dependent variable without distinguishing between gold, silver and bronze medals.

Following Choi et al. (2019) , who note that a log-transformation can reduce the skewness and, thus, improve prediction accuracy in machine learning, we take the logarithm of the number of medals, which reduces the (right-) skewness from 3.2 to 0.4 (i.e., only among non-zero medals due to the definition of logarithm). As the independent variables do not change at the same rate as the Olympic medal totals, we cannot expect an exact match between forecast and actual medals at stake. Thus, we need to rescale the prediction to the number of scheduled events times three (assuming no double bronze). Further, rounding is necessary to get natural numbers.

The predictive power of GDP for sportive success is widely accepted in Olympic medal forecasting (cf. Bernard and Busse, 2004) and robust across both geographies (cf. Manuel Luiz and Fadal, 2011) and sports (e.g. Klobučník et al., 2019) 4 . Several reasonable explanations include that richer nations invest higher sums in sports, provide more extensive sports offerings and cater for a better overall fitness among the population (De Bosscher et al., 2008) . Due to limited data availability and high allocation complexity on a more granular level, aggregate figures, such as GDP, have become a de facto standard in academia (De Bosscher et al., 2006; Manuel Luiz and Fadal, 2011) . To account for the character of the Olympic Games as a competition, we normalize and use the share of a nation in the global GDP as feature.

Besides GDP, the population of a nation is a well-established predictor of Olympic medals (Bernard and Busse, 2004; De Bosscher et al., 2008) ; that is, larger countries have larger resources of potential medal winners (Bernard and Busse, 2004) . As the number of world-class athletes in a country, however, is exhausted at some point and population alone does not lead to more medals anymore 5 , we take the logarithm of the population which grows slower than a linear function. We also reflect the number of participating athletes in the model. Scelles et al. (2020) suggest the use of categorical variables for the number of athletes. Here, the rationale is that the final number of competitors is generally not known at the time of forecasting. Furthermore, the categories suggested by them have rarely changed in the past. As an example, Afghanistan has always sent between zero and nine athletes since 1992. Before forming these groups, we count athletes that started in multiple disciplines multiple times as their chances to win a medal multiply.

While existing research confirms an impact of specific socio-economic variables on the number of medals in the Olympic Games, the connection of public health crises and sportive 5 For instance, in India, the world's second largest country based on population, the significant growth of population between 1992 and 2016 (CAGR: 1,58%) was hardly converted into Olympic medals; while India won zero medals in 1992, the number did not increase significantly until 2016 (two medals). As a comparison, China, a country with a similar population, won 70 medals in 2016. For an in-depth analysis of the Olympic performance of India, please refer to Krishna and Haglund (2008). performance was not to be presumed before the COVID-19 crisis. Yet, this pandemic did not only lead to the postponement of the Games to 2021 (International Olympic Committee, 2020) but also affected the athletes' preparation (Mohr et al., 2020; Mon-López et al., 2020; Wong et al., 2020) , as well as the funds available in the sports industry (Hammerschmidt et al., 2021; Horky, 2021; Parnell et al., 2021) . We reflect the impact of COVID-19 via incidents of and deaths from lower respiratory diseases, as well as GDP. We categorise the two first-mentioned features in quintiles to limit the effect of potential outliers. The broad availability of data allows us to create a synthetic "no COVID-19" scenario by eliminating COVID-19 incidents and deaths, and by leveraging GDP forecasts made before the beginning of the pandemic; hence, we can quantify the impact of COVID-19 on Olympic medals.

Already Ball (1972, p. 191) , in his seminal contribution, mentioned that "hosts [of Olympic Games] are more successful, at least in part because of their ability to enter larger than usual teams at relatively low financial expenditure". Social pressure on the decision making of participants might also explain such a "host-effect" (Garicano et al., 2005; Dohmen and Sauermann, 2016; Bryson et al., 2021) . We, therefore, include a categorical variable for past, current, and future host countries. Bernard and Busse (2004) detected, that Soviet countries outperformed their expected medal success on a regular basis due to the essential role of sports in the communist regimes.

Starting early on and combining competitive sports and education was an essential component in their strategy (Metsä-Tokila, 2002) . Reflecting such peculiarities in political systems, we use the trichotomy in capitalist market economies, (post-) communist economies and Central Eastern European countries, that joined the EU, as refined by Scelles et al. (2020) .

Further, geographic characteristics determine the capabilities of succeeding in a given sport because of culture, tradition and climate (Hoffmann et al., 2002) . Subsequently, we use 21 regions as defined by the United Nations, Department of Economic and Social Affairs (2020) as categorical independent variable. Finally, as recommended by Scelles et al. (2020) and Celik and Gius (2014) , the number of medals in the preceding Olympics (non-logarithmic) is added to the model as it significantly improves the predictive power. This suggests that there are some unconsidered country-specific factors, which may "include a nation's athletic tradition, the health of the populace, and geographic or weather conditions that allow for greater participation in certain athletic events" (Celik and Gius, 2014, p. 40) .

We display the descriptive statistics of the numerical variables used in the model in Table 1 and list ordinal and categorical variables in Table 2 . As a rule of thumb, Stekler et al. (2010) find that more recent data generate better results in sports forecasting. Thus, we leverage data from one year prior to the Olympics to make a forecast. Thus, being able to retrieve data points of 206 countries between 1991 and 2020, we can feed our models with 1,379 countryyear observations. in this case our prediction applies to athletes from Russia regardless the name of their team. We split the athletes (269) and medals (7) of the former Czechoslovakia into Czech Republic (178 / 5) and Slovakia (91 / 2) based on their respective population; this allows adequate forecasts for the two nations that emerged from Czechoslovakia. The Refugee Olympic Team (12 athletes in 2016) has not won a medal yet. Hence, we assume a constant forecast meaning that there will be no medals in 2021 either.

We obtain missing data points in a specific year by inter-/ extrapolation, which is a common approach when pre-processing data (e.g. Christodoulos et al., 2010; Chen et al., 2019) . While we interpolate linearly if certain years between two data points are missing, we extrapolate by assuming constant numbers based on the first (respectively last) value in the dataset; this way we do not mis-interpret local events (Scott Armstrong and Collopy, 1993) . Exceptions are GDP and population which typically inhibit a consistent growth; linear extrapolation is sensible here.

For nations, where not more than five consecutive points are missing and there are more data points available than missing, we extrapolate linearly to take the implicit trend into account by using a constrained least-squares approach: With n<6 missing values, we use the n+1 nearest available values to estimate the slope of the line. The intercept is given by the nearest available value.

If there are no data points for one country available at all, we leverage the average of the respective region (United Nations, Department of Economic and Social Affairs, 2020) as benchmark. The rationale here is that countries within one region also share socio-economic characteristics, such as economic strength. Yet, this approach is only necessary for some variables of some smaller countries, which are responsible for 1% of total Olympic medals.

The Tobit model marked a milestone in Olympic medal forecasting accounting for the large number of nations winning zero medals (Bernard and Busse, 2004) . The concept traces back to Tobin (1958) , who argues that for censored variables, linear regression models do not deliver suitable results. To apply this statistical concept in machine learning, we develop a two-staged algorithm: As a first step, we train a binary classifier to determine whether a nation should win any medals or not. As a second step, we train a regression model to forecast the exact number of medals for countries with predicted medal success.

In both steps, we employ a Random Forest algorithm (cf. Lee, 2021) , an ensemble learner which has been proven advantageous in various disciplines of sports forecasting (cf. and direct use for high dimensional problems) and a statistical (among others due to measures of variable importance, differential class weighting and outlier detection) point of view. The main shortcoming of (individual) decision trees is that they are prone to overfitting (Kirasich et al., 2018) . Even though Random Forests partly account for this issue "by using a combination or 'ensemble' of decision trees where the values in the tree are a random, independent, sample" (Kirasich et al., 2018, p . 7), a diligent setup of the forecasting process is essential.

For this reason, we apply a rigorous time-consistent data separation to avoid cases of overfitting and obtain impartial and robust results (Dwork et al., 2015; Roelofs et al., 2019) . Crossvalidation splits the dataset in training and testing data; while the training data determines the model, the testing data ensures validity beyond a fixed data sample (Kerbaa et al., 2019; Li et al., 2020) . In case of time series, as presented in this paper, it is essential to consider the temporal evolution and dependencies in the data; therefore, Bergmeir and Benítez (2012) suggest last block cross-validation, a special case of cross-validation using the most recent datapoints as testing data.

More specifically, we use data collected from the years 1991 to 2004 as the training set, and data from the 2008 Olympic Games as the test set, to evaluate and compare the performance of distinct models. Only then, we evaluate the final model on the validation set, which includes data of the 2012 and 2016 Olympic Games. We illustrate the overall forecasting process in Figure 1 .

To demonstrate the performance of the previously introduced two-staged Random Forest, we benchmark the model against others based on the share of nations with correctly forecast medals in the 2008 Olympic Games (training of different models). For the first step of the model, as classifier, we also consider a support vector machine and Random Forests with one, a hundred, and a thousand trees. For the second step, the regression, we benchmark against a range of classical regressions, boosting methods, and neural networks: As classical regressions, we consider a linear regression, a Support Vector Machine taking into account nonlinear transformations of features (Chang and Lin, 2011) , and a decision tree regression (Breiman et al., 1984) . Boosting methods (Bühlmann and Hothorn, 2007) perform several instances of decision trees (in this case). Each tree compares the output variable to the forecast from the previous step and adapts the setup for the next step based on the error. We include AdaBoost (Freund and Schapire, 1997) , which directly takes into account the error, and

XGBoost (Chen and Guestrin, 2016) , which first transforms the error, as benchmark. Neural networks (LeCun et al., 2015) were motivated by the structure of a biological brain. They use a computational network, where each node performs a simple transformation and hands over the result to subsequent nodes. In both steps, as classifier (10 trees) and regression model (1,000 trees), the Random Forest outperforms the described algorithms.

To further validate this performance, we use the two-staged Random Forest to forecast the 2012 and 2016 Olympic Games (model validation). This is particularly interesting as it allows us to compare our model to past medal forecasts presented in academic literature. When computing our estimates, we use the same datapoints that had already been available when the respective papers were developed.

Finally, we make predictions for the 2020 Olympic Games (forecasting); this time, we use data containing the 2016 Olympics as well. While we show these results in Table 4 , none of the additional information, i.e. data on the Olympics 2016, is used in the forecasts summarized in Table 3 .

We implement all models in Python 3.8.5 (Oliphant, 2007) 

While Scelles et al. (2020) improved the prior forecast quality, the presented models still fail to outperform a naïve forecast, i.e. assuming that each country wins exactly the same number of medals as in the previous Olympics. The approach presented in this paper is, to the best of our knowledge, the first to consistently beat the naïve forecast for the 2008, 2012, and 2016 Games (cf. Table 3 ). Besides the naïve forecast, we include five other models from three different papers in our benchmark. We evaluate the forecast accuracy, i.e. the share of correctly predicted medals, in total and for all nations winning zero, respectively more than zero medals.

Furthermore, we check whether the forecast lies in a 95% confidence interval augmented by two medals. Finally, we sum the absolute deviation of forecast and actual medals of the top-17 nations as suggested by Scelles et al. (2020) . Note that our sample includes 203 ( computing our estimates, we use the same datapoints that had already been available when the respective papers were developed (cf. Figure 1 ).

Illustration of forecasting process. (Forrest et al., 2010) 47% Tobit Model (Andreff et al., 2008) 5% Logit Model (Andreff et al., 2008) 0% Hurdle Model (Scelles et al., 2020) 22% Tobit Model (Scelles et al., 2020) 43% Tobit Model (Maennig and Wellbrock, 2008) 41% OLS (Celik and Gius, 2014) 10% Correct forecast (non-zero medals) Two-Staged Random Forest (This Paper) 14% 11% 17% Naïve Forecast 9% 11% 16% Tobit Model (Forrest et al., 2010) 17% Tobit Model (Andreff et al., 2008) Logit Model (Andreff et al., 2008) Hurdle Model (Scelles et al., 2020) 22% Tobit Model (Scelles et al., 2020) 11% Tobit Model (Maennig and Wellbrock, 2008) 11% OLS (Celik and Gius, 2014) 10% Correct forecast (zero medals) Two-Staged Random Forest (This Paper) 98% 93% 97% Naïve Forecast 96% 88% 92% Tobit Model (Forrest et al., 2010) 94% Tobit Model (Andreff et al., 2008) Logit Model (Andreff et al., 2008) Hurdle Model (Scelles et al., 2020) 22% Tobit Model (Scelles et al., 2020) 69% Tobit Model (Maennig and Wellbrock, 2008) 83% OLS (Celik and Gius, 2014) 95% confidence intervals +/-2 medals Two-Staged Random Forest (This Paper) 92% 96% 93% Naïve Forecast Tobit Model (Forrest et al., 2010) Tobit Model (Andreff et al., 2008) 60% Logit Model (Andreff et al., 2008) 45% Hurdle Model (Scelles et al., 2020) 93% Tobit Model (Scelles et al., 2020) 91% Tobit Model (Maennig and Wellbrock, 2008) OLS (Celik and Gius, 2014 (Forrest et al., 2010) 92 Tobit Model (Andreff et al., 2008) 135 Logit Model (Andreff et al., 2008) 204 Hurdle Model (Scelles et al., 2020) 139 Tobit Model (Scelles et al., 2020) 138 Tobit Model (Maennig and Wellbrock, 2008) 153 OLS (Celik and Gius, 2014) 104 Applying the algorithm in the context of Tokyo 2020, we find that there will be no movement on the top of the medal count compared to the 2016 Olympics (cf. Table 4 ). While the United States are predicted to defend their top position, the distance to the pursuers China (+17 medals compared to 2016), Great Britain (+7), and Russia (+7) should melt. To understand the main drivers behind the forecasts better, we use the explanatory SHAP value (Lundberg and Lee, 2017) . SHAP stands for "Shapley Additive Explanations" and quantifies the importance of features for a forecast. The SHAP value of one feature describes "the change in the expected model prediction when conditioning on that feature" (Lundberg and Lee, 2017, p. 5) ; starting from the base value, i.e. the prediction without the knowledge of any features, the combination of all SHAP values then leads to the full model forecast (Lundberg and Lee, 2017) . The game-theory-based algorithm dates back to Shapley (1953) and runs in polynomial time (Lundberg et al., 2020) . The most important features in our model are the number of medals won at the previous Olympic Games, the categorical variable representing the team size (more than 149 athletes), and the normalized GDP (cf. Figure 2) . All of them, generally, have a positive impact on the number of medals won.

Feature importance of the Two-Staged Random Forest.

Abbreviations and notes. Only the 20 most relevant features are depicted. One dot represents one observation in the training data, i.e. one Olympia-nation-combination. Variables are ranked in descending order according to their feature importance. The horizontal location shows whether the effect of the value is associated with a higher or lower prediction. Color shows whether that variable is high (in red) or low (in blue) for each observation. A high "Number of Medals at previous Olympics" has a high and positive impact on the number of medals at the current Olympics. The "high" comes from the red color, and the "positive" impact is shown on the X-axis. Similarly, the "Diseases Deaths" is negatively correlated with the dependent variable.

Three of the features are directly impacted by COVID-19: GDP, incidents of and deaths from lower respiratory diseases. This allows us to create a theoretical scenario without the presence of the pandemic, such that we can clearly quantify its impact (cf. Individual feature importance for China in the current scenario (top) and in the no COVID-19 scenario (bottom).

Abbreviations and notes. The value for log(Nr. Medals) describes the respective forecast. The base value would be predicted without any knowledge for the current output. Features that push the prediction higher, i.e. to the right are shown in red, while those pushing the prediction lower are illustrated in blue.

On the opposite, Spain experiences a loss of two medals. Both, a relatively smaller GDP and an increased number of incidents of lower respiratory diseases are responsible for this development (cf. Figure 4 ). Individual feature importance for Spain in the current scenario (top) and in the no COVID-19 scenario (bottom).

Abbreviations and notes. The value for log(Nr. Medals) describes the respective forecast. The base value would be predicted without any knowledge for the current output. Features that push the prediction higher, i.e. to the right are shown in red, while those pushing the prediction lower are illustrated in blue.

Applying a two-staged Random Forest, we significantly improved the forecasting accuracy regarding Olympic medals outperforming a naïve model and currently existing statistical approaches applied in academic literature. A forecast of the Tokyo 2020 Olympic Games hosted in 2021 suggests that the United States lead the medal count followed by China and Great Britain. Particularly China, that has largely invested in sports development, is likely to exhibit a rise in medals. Our findings are highly relevant for several stakeholders, not only ex ante, i.e. before the Olympics, but also ex post. Ex ante, media companies and sport sponsors could allocate their resources to promising nations, that are likely to increase their performance compared to the previous Olympics (e.g. China). While spectators demand stories about Olympic heroes rather than group-stage knock-outs, the right focus when planning documentaries or interviews is essential for the media to reach a high audience share. The same concept holds for sponsors who profit from signing athletes, that are at the centre of high media attention. Sports betting companies offer bets on the Olympic medal count. While they generally apply different datasets than the one used in this paper to determine the odds, the strong performance of the two-staged Random Forest suggests that a detailed comparison of both models and potential re-calibration might be beneficial.

Ex post, sports politicians and managers are facing the challenge to judge the performance of their teams. Our forecast allows them to detect over-or underperformance against what was to be expected ex ante. Such an evaluation helps to assess the impact of specific investments or training concepts. Subsequently, funds for preparing the team for the next Olympics can be allocated. The forecast shows that COVID-19 hardly impacts the number of medals among the top-20 nations. This is mainly due to the fact that decision trees generally (and hence the two-staged Random Forest) exhibit weaknesses regarding extrapolation, in this case caused by the surge in incidents of and deaths from COVID-19 (e.g. Zhao et al., 2020) .

Training our model with data of the Tokyo 2020 Olympic Games allows us to quantify the impact of a pandemic like COVID-19 even better. Only then, policy leaders will get a reliable picture on the connection between the management of a pandemic and national sportive success.

Two ways to further improve the performance of the model are the inclusion of additional features and a novel approach for missing data points:

First, socio-economic, e.g. investments in sports infrastructure, athlete-specific, e.g. age or disciplines of athletes, and COVID-specific, e.g. number of cancelled national sports events, deliver additional insights and, thus, might improve the forecasting accuracy. Brown et al. (2018) use social media data to forecast football matches, which is an approach that could be applied to Olympic Games as well. However, as machine learning methodologies are prone to overfitting, adding new features is only possible to some extent.

Second, while we use inter-and extrapolation to handle missing data points, Hassan et al. (2009) generate the missing values using their probability distribution function. This approach outperforms the conventional mean-substitution approach, however, superiority to inter-and extrapolation, as applied in this paper, still needs to be proven. Besides working on model-specific adjustments, scholars can build upon our research within the scope of new applications in sports forecasting. As the Olympic Games are not the only important global sports event, both, the comprehensive data set and concept of the twostaged Random Forest, presented in this paper, can be leveraged in the context of other competitions, e.g. the Football World Cup, as well.

Complete forecast medal count of the Olympic Games Tokyo 2020 including 95% confidence intervals (scenarios with and without 

Sport, Prestige and International Relations

Les déterminants économiques de la performance olympiques: prévision des médailles qui seront gagnées aux Jeux de Pékin

Predictive analysis and modelling football results using machine learning approach for English Premier League

Outcome uncertainty in sporting competition: the Olympic Games 1896-1996

Olympic Games Competition: Structural Correlates of National Success

Using bees algorithm and artificial neural network to forecast world carbon dioxide emission. Energy Sources, Part A: Recovery, Utilization

Forecasting municipal solid waste generation in major European cities

On the use of cross-validation for time series predictor evaluation

Who wins the Olympic Games: Economic resources and medal totals

The Impact of Public Investment in Sports on the Olympic Medals

Random forests

Classification and regression trees

Forecasting With Social Media: Evidence From Tweets on Soccer Matches

Causal effects of an absent crowd on performances and refereeing decisions during Covid-19

Boosting algorithms: Regularization, prediction and model fitting

2020. Features, evaluation and treatment coronavirus (COVID-19). Statpearls

Estimating the determinants of summer Olympic game performance

LIBSVM: a library for support vector machines

A hybrid PSO-SVM model based on clustering algorithm for short-term atmospheric pollutant concentration forecasting

XGBoost: A Scalable Tree Boosting System

Modelling Chlorophyll-a Concentration using Deep Neural Networks considering Extreme Data Imbalance and Skewness

Forecasting with limited data: Combining ARIMA and diffusion models

Predicting the success of nations at the Summer Olympics using neural networks

Random forests

The paradox of measuring success of nations in elite sport

A Conceptual Framework for Analysing Sports Policy Factors Leading to International Sporting Success

Referee Bias

Preserving Statistical Validity in Adaptive Data Analysis, STOC '15: Proceedings of the forty-seventh annual ACM symposium on Theory of Computing

On the determinants of sporting success-A note on the Olympic Games

Determinants of national medals totals at the summer Olympic Games: an analysis disaggregated by sport

An analysis of country medal shares in individual sports at the Olympics

Forecasting national team medal totals at the Summer Olympic Games

A decision-theoretic generalization of on-line learning and an application to boosting

Favoritism Under Social Pressure

A sustainable sports legacy: Creating a link between the London Olympics and sports participation

Global Burden of Disease Study

120 years of Olympic history: athletes and results

A socioeconomic model of national Olympic performance

A hybrid random forest to predict soccer matches in international tournaments

Professional football clubs and empirical evidence from the COVID-19 crisis: Time for sport entrepreneurship?

Novel ensemble techniques for regression with missing data

The tip of the iceberg: The Russian doping scandal reveals a widespread doping problem

Public policy and olympic success

No sports, no spectators -no media, no money? The importance of spectators and broadcasting for professional sports during COVID-19

Estimating the value of medal success in the Olympic Games

COVID-19 Mortality, Infection, Testing, Hospital Resource Use, and Social Distancing Projections

World Economic Outlook Database

International Olympic Committee, 2020. Press statement on March 30th

A tale of two seasons: participation and medal counts at the Summer and Winter Olympic Games

Forecasting methods in the social sciences

Modeling and forecasting of Turkey's energy consumption using socio-economic and demographic variables

Model Selection of Sea Clutter Using Cross Validation Method

Random Forest vs logistic regression: binary classification for heterogeneous datasets

Football clubs' sports performance in the context of their market value and GDP in the European Union regions

Why do some countries win more Olympic medals? Lessons for social mobility and poverty reduction

Olympic participation and performance since 1896. Available at SSRN 274295

Deep learning

A review of data analytics in technological forecasting

Gold, silver, and bronze: Determining national success in men's and women's Summer Olympic events

Alternative methods of predicting competitive events: An application in horserace betting markets

Network cross-validation by edge sampling

Role of media coverage in mitigating COVID-19 transmission: Evidence from China

Guys and gals going for gold: The role of women's empowerment in Olympic success

Men, money, and medals: An econometric analysis of the Olympic Games

From Local Explanations to Global Understanding with Explainable AI for Trees

A Unified Approach to Interpreting Model Predictions

Forecasting in social settings: The state of the art

An economic analysis of sports performance in Africa

Data Structures for Statistical Computing in Python

Combining Competitive Sports and Education: How Top-Level Sport Became Part of the School System in the Soviet Union, Sweden and Finland

Long-term GDP forecasts and the prospects for growth

Return to elite football after the COVID-19 lockdown

How has COVID-19 modified training and mood in professional and non-professional football players?

Asian Participation and Performance at the Olympic Games

What goes into a medal: Women's inclusion and success at the Olympic Games

An old boys club no more: pluralism in participation and performance at the Olympic Games

Python for Scientific Computing

How Many Trees in a Random Forest? In: pp

Football Worlds: business and networks during COVID-19

Scikit-learn: Machine Learning in Python

Innovation, lifestyle, policy and socioeconomic factors: An analysis of European quality of life

The (non) determinants of Olympic success

A Meta-Analysis of Overfitting in Machine Learning

Forecasting National Medal Totals at the Summer Olympic Games Reconsidered

Causal forces: Structuring knowledge for time-series extrapolation

A value for n-person games

Issues in sports forecasting

Anticipated feelings and support for public mega projects: Hosting the Olympic Games

Reconsidering performance at the Summer Olympics and revealed comparative advantage

Economic Policy & Debt: National accounts: US$ at current prices: Aggregate indicators

Estimation of Relationships for Limited Dependent Variables

Success at the summer Olympics: How much do economic factors explain?

United Nations, Department of Economic and Social Affairs

United Nations, Department of Economic and Social Affairs, 2020. Standard country or area codes for statistical use (M49)

Climatic origin is unrelated to national Olympic success and specialization: an analysis of six successive games (1996-2016) using 12 dissimilar sports categories

Olympic medals and demo-economic factors: Novel predictors, the ex-host effect, the exact role of team size, and the "population-GDP" model revisited

The NumPy Array: A Structure for Efficient Numerical Computation

Big data analytics: Understanding its capabilities and potential benefits for healthcare organizations

The Olympic Games and raising sport participation: a systematic review of evidence and an interrogation of policy for a demonstration effect

List of Olympic Games host cities

Impact of the COVID-19 pandemic on sports and exercise

World Health Organization, 2020. WHO Coronavirus Disease (COVID-19) Dashboard

Prediction and behavioral analysis of travel mode choice: A comparison of machine learning and logit models

Other model Ball (1972) 1964 S X Grimes et al. (1974 ) 1936 , 1972 S X Baimbridge (1998 1896-1996 S X Condon et al. (1999) 1996 S X X Kuper and Sterken (2001) 1896-2000 S X Hoffmann et al. (2002) 2000 S X Tcha and Pershin (2003) 1988-1996 S X Johnson and Ali (2004) 1952-2000 S, W X X Bernard and Busse (2004) 1960-1996 S X Lui and Suen (2008) 1952