key: cord-0637108-fv531bj0 authors: Price, Katherine; Cai, Hengrui; Shen, Weining; Hu, Guanyu title: How Much Does Home Field Advantage Matter in Soccer Games? A Causal Inference Approach for English Premier League Analysis date: 2022-05-15 journal: nan DOI: nan sha: 645535f60b423218c652f49abbd48999dfbcf364 doc_id: 637108 cord_uid: fv531bj0 In many sports, it is commonly believed that the home team has an advantage over the visiting team, known as the home field advantage. Yet its causal effect on team performance is largely unknown. In this paper, we propose a novel causal inference approach to study the causal effect of home field advantage in English Premier League. We develop a hierarchical causal model and show that both league level and team level causal effects are identifiable and can be conveniently estimated. We further develop an inference procedure for the proposed estimators and demonstrate its excellent numerical performance via simulation studies. We implement our method on the 2020-21 English Premier League data and assess the causal effect of home advantage on eleven summary statistics that measure the offensive and defensive performance and referee bias. We find that the home field advantage resides more heavily in offensive statistics than it does in defensive or referee statistics. We also find evidence that teams that had lower rankings retain a higher home field advantage. Quantitative analysis in sports analytics has received a great deal of attention in recent years. Traditional statistical research for sports analytics mainly focused on game result prediction, such as predicting the number of goals scored in soccer matches (Dixon and * pricekm@mail.missouri.edu † hengruicai@gmail.com ‡ weinings@uci.edu § guanyu.hu@missouri.edu Coles, 1997; Karlis and Ntzoufras, 2003; Baio and Blangiardo, 2010) , and the basketball game outcomes (Carlin, 1996; Caudill, 2003; Cattelan et al., 2013) . More recently, fast development in game tracking technologies has greatly improved the quality and variety of collected data sources (Albert et al., 2017) , and in turn substantially expanded the role of statistics in sports analytics, including performance evaluation of players and teams (Cervone et al., 2014; Franks et al., 2015; Wu and Bornn, 2018; Hu et al., 2021 Hu et al., , 2022 , commentator's in-game analysis and coach's decision making (Fernandez and Bornn, 2018; Sandholtz et al., 2020) . The home field advantage is a key concept that has been studied across different sports for over a century. The first public attempt to statistically quantify the home field advantage was about 40 years ago (Inan, 2018) , where Schwartz and Barsky (1977) studied the existence of home advantage in several sports and found that its effect was most pronounced in the indoor sports such as ice hockey and basketball. Since then, statisticians and psychologists have dug their heels into trying to map out this anomaly, and there is some consensus on it. The percentages found through different studies in general float around 55-60% of an advantage to the home team (wherein 50% would mean there is no advantage either way). Courneya and Carron (1992) found that different sports in general held different advantage percentages: 57.3% for American football versus 69% for soccer, for example. In a study of basketball home court advantage, Calleja-González et al. (2018) found that the visiting team scored 2.8 points less, attempted 2 fewer free throws, and made 1 less free throw than their season averages. Season length is also shown to have a negative correlation with the strength of home field advantage, where sports with 100 games or more per season were shown to have significantly less advantage than a sport with fewer than 50 games per season (Jamieson, 2010) . This finding can be explained by the fact that each game becomes less important on average as the number of games increases. Some studies have shown that the home field advantage becomes less significant in the modern era. For example, Jamieson (2010) found that the home field advantage was the strongest in the pre-1950s, as opposed to any other twenty-year block leading up to 2007, the year he conducted the study. In another work, Sánchez et al. (2009) showed that between 2003 and 2013, the home field advantage for the UEFA Champions League decreased by 1.8%, supporting the claim that as time goes on, the home field advantage decreases to a certain level and remains still. In general, the home field advantage tends to deteriorate as the athletes play longer on an away field. In principle, there are four major factors associated with the home field advantage: crowd involvement, travel fatigue, familiarity of facilities, and referee bias that benefits the home teams. Several studies (Schwartz and Barsky, 1977; Greer, 1983) have shown a positive correlation between the size of the crowd and the effect of the home field advantage, an advantage that can get as high as 12% over a team's opponents. Travel fatigue is another important factor for home field advantage (Jamieson, 2010; Calleja-González et al., 2018; R Jehue, 1993) , since the athletes have an overall reduction in mean wellness, including a reduction of sleep, self-reported feelings of jet lag and energy reduction; the importance of sleep and proper recovery cannot be over-exaggerated, and it can be difficult to do either of those properly in unfamiliar settings (e.g., a hotel room or a bus). Furthermore, familiarity of facilities gives the home team a strong advantage over a visiting team (Pollard, 2002) . A home team not only does not have to travel to an unfamiliar facility, be away from home, and spend hours transporting themselves; they also get to use their own facility, locker room, and field. It cannot be underestimated the value of knowing and knowing well the nooks and crannies of a facility. Ultimately, giving rule advantages to the home team is very common in certain sports (Courneya and Carron, 1992) . For example, in baseball the home team gets the advantage of batting last, giving them the final opportunity to score a run in the game. In hockey, the last line change goes to the home team. Soccer has proven to have one of the highest home field advantages among major sports in a majority of current research. Pollard (2002) states that the home field advantage in soccer is equivalent to 0.6 goals per game and the visual cues that come from being intimately familiar with a facility can be exponentially helpful in fast paced sports such as soccer, which could perhaps explain why the home field advantage is less pronounced in slower, stopping sports such as baseball. Sánchez et al. (2009) observes that the better a soccer team is, the more often the home field advantage appears, after studying home field advantage across groups of variably ranked soccer teams. Despite the vast amount of existing work on home field advantage quantification in soccer and other sports (Leitner and Richlan, 2020; Benz and Lopez, 2021; Fischer and Haucap, 2021) , very little is known about its causal effect on team performance; and it is our goal in this paper to fill this gap. In particular, we study soccer games by analyzing a data set collected from the English Premier League (EPL) 2020-2021 season, where 380 games were played, that is, two games between each pair of 20 teams. We choose eleven team-level summary statistics as the main outcomes that represent team's performance on defensive and offensive sides, and the referee bias. We then develop a new causal inference approach for assessing the causal effect of home field advantage on these outcomes. More details about our data application is provided in Section 2. In causal inference literature (for observational studies), causal effect estimation of a binary treatment is a classic topic that has been intensively studied (Ding and Li, 2018; Imbens, 2004) . Over the past few decades, a number of methods in both statistics and econometrics have been proposed to identify and estimate the average treatment effects of a particular event/treatment (see a recent overview in Athey et al., 2017) under the assumption of ignorability (also unknown as unconfounded treatment assignment) (Rosenbaum and Rubin, 1983) , including regression imputation, (augmented) inverse probability weighting (see e.g., Horvitz and Thompson, 1952; Rosenbaum and Rubin, 1983; Robins et al., 1994; Bang and Robins, 2005; Cao et al., 2009) , and matching (see e.g., Rubin, 1973; Rosenbaum, 1989; Heckman et al., 1997; Hansen, 2004; Rubin, 2006; Imbens, 2006, 2016) . However, the data structure in soccer games is distinguished from those in the existing literature, leading to two unique challenges in identifying the causal effects of home field advantage. First, the EPL data set is obtained neither from a randomized trial or observational studies as standard in casual inference literature, but instead is based on a collection of pair-wise matches between every two teams. For one season, each team will have one home game and one away game with each of the other 19 opponents. All matches are pre-scheduled. Thus, propensity-score-based methods can hardly gain efficiency with such design. Secondly, for each match, there is one team with home field advantage and the other without, i.e., there are no matches in neutral field. In fact, this is the common practice for other major professional sports leagues such as NBA, NFL, and MLB. In other words, there is technically no control group where both teams have no home field advan-tage. Hence, we cannot rely on matching-based method to estimate the casual effects. To overcome these difficulties, in this paper, we establish a hierarchical causal model to characterize the underlying true causal effects at the league and team levels, and propose a novel causal estimation approach for home field advantages with inference procedures. Our proposed method is unique in the following aspects. First, the idea of pairing home and away games for solving causal inference problem is novel. In fact, this idea and our proposed approach are widely applicable to general sports applications such as football, baseball and basketball studies, and provide a valuable alternative to the existing literature that mainly relies on propensity score. Secondly, under the proposed hierarchical model framework, both league level and team level causal effects are identifiable and can be conveniently estimated. Moreover, our inference procedure is developed based on linear model theory that is accessible to a wide audience including first-year graduate students in statistics. Thirdly, our real data analysis results reveal several interesting findings from England Primer League, which may provide new insights to practitioners in sports industry. It is our hope that the data application presented in this paper as well as the developed statistical methodology can reach to a wide range of audience, be useful for educational purpose in statistics and data science classes, and stimulate new ideas for more causal analysis in sports analytics. The rest of the paper is organized as follows. In Section 2, we give an overview of our motivating data application and introduce several representative summary outcomes. In Section 3, we introduce the causal inference framework and propose a hierarchical causal model for both team-level and league-level home field advantage effect estimation. Extensive simulation studies are presented in Section 4 to investigate the empirical performance of our approach. We apply our method to analyze the 2020-2021 England Primer League in Section 5 and conclude with a discussion of future directions in Section 6. We are interested in studying the causal effect of home field advantage for in-game performance during English Premier League 2020-2021 season. Despite as obvious as it sounds, the home field advantage is in fact not evident at first glance through the usual descriptive statistics such as the game outcomes and number of scored goals. For example, among the 380 games in that season, the number of home wins is 144 (37.89%), which is even less than the number of away wins, 153 (40.26%). Among the total of 1024 goals scored in that season, 514 (50.19%) were scored by the home team, and 510 (49.81%) by the away team. To better understand this phenomenon, we study a data set provided by Hudl & Wyscout, a company that excels at soccer game scouting and match analysis. The data is collected from 380 games played by 20 teams in EPL, and includes a number of statistics collected from each game that range across the defensive and offensive capabilities of the home and away teams and players. We choose to focus on eleven in-game statistics as follows: • Attacks w/ Shot -The number of times that an offensive team makes a forward move towards their goal (a dribble, a pass, etc) followed by a shot; • Defence Interceptions -The number of times that a defending team intercepts a pass; • Reaching Opponent Box -The number of times that a team moves the ball into their opponent's goal box; • Reaching Opponent Half -The number of times that a team moves the ball into their opponent's half; • Shots Blocked -The number of times that a defending team deflects a shot on goal to prevent scoring; • Shots from Box -The number of shots taken from the goal box; • Shots from Danger Zone -The number of shots taken from the "Danger Zone", which is a relative area in the center of the field approximately 18 yards or less from the goal; • Successful Key Passes -The number of passes that would have resulted in assists if the resulting shot had been made; • Touches in Box -Number of passes or touches that occur within the penalty area; • Successful Key Passes -The number of passes that would have resulted in assists if the resulting shot had been made; • Expected Goals (XG) -The average likelihood a goal will be scored given the position of the player over the course of a game; • Yellow Cards -The number of yellow cards given to a team in a game. Table 1 contains the eleven summary statistics from the raw data, where each row shows the statistic, its primary role in the game (defense, offense, or referee), the means for the home team and for the away team, respectively, and the overall standard deviation. These in-game statistics are chosen because they are most relevant to studying the home field advantage and also sufficient to cover different aspects of the soccer games (e.g., team offensive and defensive performance). To better understand the role of these selected statistics. We choose three of them (representing offense, defense, and referee), calculate the difference in these statistics between home and away games for each team and present their distributions in Figure 1 to Figure 3 . The offensive statistic is Reaching Opponent Half, and its distribution for different teams is illustrated in Figure 1 . The difference between home and away teams, in theory, should be positive overall if the home field advantage is present, negative if there is actually an advantage towards the away team, and 0 if there is no advantage either way. From the picture, we can see a clear trend for home advantage, e.g., 17 of the 20 teams have a positive mean, which implies that they reach their opponents' half more often on average when they play at home as opposed to away. This finding is in fact fairly consistent across all offensive variables. The three teams that do not show evidence of home field advantage for Reaching the Opponent's Half are Aston Villa, Chelsea, and Southampton. The top three teams from that season (Manchester City, Manchester United, and Liverpool) exhibit similar patterns in the picture. The Defense Interceptions is chosen as a representation for the defensive statistic; and we present its distribution of the difference between home and away teams in Figure 2 . Since defence interceptions negatively impact a team, the home field advantage will be seen here if, oppositely to the offensive statistic, the distribution of the difference is skewed negatively. Five teams do not show evidence of home field advantage: Chelsea, Crystal Palace, Everton, Fulham, and Southampton. Interestingly, the top three teams from that season (Manchester City, Manchester United, and Liverpool) all have negative means but show a different level of variability in this statistic. Finally, the referee bias is represented by the team-level number of yellow cards in each 3 Method Suppose there are n different teams in the league, denoted by {T 1 , · · · , T n }, with a total number of N ≡ n(n − 1) matches between any pair of two teams, T i and T j , where i = j and i, j ∈ {1, · · · , n}. Without loss of generality, we assume n ≥ 3. Therefore, each team T i has (n − 1) matches with home field advantage and another (n − 1) without. Define a treatment indicator δ i = 1 if team T i has the home field advantage in a match against team T j for j = i, and δ i = 0 otherwise. By definition, we have δ i + δ j = 1 for any match between T i and T j . The main outcome is defined as the net difference in the outcome of interest, i.e., Y i,j = (c i − c j ) ∈ R, where c i and c j are one of the eleven statistics that we described in Section 2 for team T i (with home field advantage) and team T j (without home field advantage), respectively. Following the potential outcome framework (see e.g., Rubin, 1974) , we define the potential outcome Y * (δ i = a, δ j = 1 − a) as the outcome of interest that would be observed after the match between team T i and team T j , where a = 1 or a = 0 corresponds to that team T i or team T j has the home field advantage, respectively. As standard in the causal inference literature (see e.g., Rosenbaum and Rubin, 1983) , we make the following assumptions for any pairs of i = j. (A1). Stable Unit Treatment Value Assumption: (A2). Ignorability: et al., 2017) , to ensure that the causal effects are estimable from observed data. By game design, each team will compete the rest teams with and without home field advantage once respectively, hence the ignorability assumption holds automatically in our study. Assumption (A3) is to rule out the correlation between different matches that involves the same team. In reality, there are style rivalries between teams in soccer games. Hence the transitive relation does not always hold. In this section, we detail the proposed hierarchical causal model and its estimation and inference procedures. Specifically, we are interested in team level and league level estimators of causal effects. To this end, we define the causal effect of home field advantage associated with team T i as and the causal effect of home field advantage for the entire league as Given finite number of teams in the league, we are interested in two estimands, the average home field advantage of the league ∆ ≡ n i=1 β i /n and the team-specific home field advantage β i . Yet, since there is no match in neutral field in major professional sports, we always have δ i + δ j = 1, i.e., Y * (δ i = 0, δ j = 0) can never be observed. To address this difficulty and estimate β i from the observational studies, we propose to decompose the outcome function into two parts, one corresponding to the home field advantage and the other representing the potential outcome in a hypothetical neutral field, via a mixed two-way ANOVA design. To be specific, the outcome of a match between team T i (with home field advantage) and team T j (without home field advantage) is modeled by where α i,j = E{Y * (δ i = 0, δ j = 0)} is the expected net outcome between team T i and T j in a hypothetical neutral field (i.e., if there is no team taking home field advantage) based on Assumption (A3), and i,j is random noise with N (0, σ 2 0 ). The factorization in (3) enables us to unravel the home field advantage at team level by utilizing the pair-wise match design in soccer games that we will discuss shortly. Assumption (A3) is required for (3) so that we can allow independent baseline effect α i,j for any pair of two teams in a hypothetical neutral field. In addition, we adopt a hierarchical model (Berry et al., 2013; Chu and Yuan, 2018; Geng and Hu, 2020) to characterize the relationship between the home field advantage of individual teams and that of the whole league as where σ 2 describes the variation of home field advantages across different teams. Based on (3), oppositely, we have the model for the match between team T j (with home field advantage) and team T i (without home field advantage) as where the net score without home field advantage satisfies α i,j = −α j,i . Thus, combining (3) and (5), we have In other words, α i,j 's are treated as nuisance parameters and do not need to be estimated in our model. By repeating (6) over all paired matches between team T i and team T j , for i = j and i, j ∈ {1, · · · , n}, we have        1 1 0 · · · 0 0 0 1 0 1 · · · 0 0 0 . . . . . . . . . · · · . . . . . . . . . 0 0 0 · · · 1 0 1 0 0 0 · · · 0 1 1 1,2 + 2,1 1,3 + 3,1 . . . (n−2),n + n,(n−2) (n−1),n + n,(n−1) This motivates us to estimate the team-specific home advantage effects through the above linear equation system. Specifically, since H is a full rank matrix under n ≥ 3, we use the following estimate of β = [β 1 , · · · , β n ] , Based on the definition that ∆ = i β i /n, we have an estimator of ∆ as where β i is the i-th element in β. Following the standard theory for linear regression, we establish the normality of β in the following proposition. Proposition 3.1 Assuming the noise terms i,j are independent and identically distributed (i.i.d.) Gaussian random variables, i.e., i,j ∼ N (0, σ 2 0 ). Under (A1)-(A3) with n ≥ 3, we have where β = [β 1 , · · · , β n ] denotes the true causal effect for n teams. The covariance matrix Proposition 3.1 holds for every n ≥ 3 since the normality of β is exact. A two-sided (1 − α) marginal confidence band of β can be obtained as where z α/2 denotes the upper α/2−th quantile of a standard normal distribution, using the estimation of Σ β from Proposition 3.1. Since ∆ is the sample mean of β i as indicated in (8), the normality of the estimated average treatment effect can be obtained immediately from Proposition 3.1. Proposition 3.2 Suppose that the same set of assumptpions in Proposition 3.1 hold. Let where an unbiased estimator for the variance σ 2 is and diag(W ) i is the i-th diagonal element of matrix W . The first part of Proposition 3.2 is obvious since ∆ is a linear combination of β, which follows a multivariate normal distribution. However, the estimation of σ 2 is non-trivial because the off-diagonal entries in the covariance estimator for Σ β do not perform well due to the high-dimensionality, i.e., number of the free parameters in Σ β is O(n 2 ) and our data have a sample size of the same order. Therefore, the usual quadratic form estimator for σ 2 , n −1 σ 2 β 1 (H H) −1 )1, is biased, where 1 is the vector of n ones. To solve this problem, the key observation is to use the law of total variance as suggested in the hierarchical modeling literature (Berry et al., 2013; Chu and Yuan, 2018; Geng and Hu, 2020) by considering where O is the observed data. The first term Var i ( β i ) can be estimated by the sample variance i ( β i − ∆) 2 /(n−1), and the second term can be estimated by i diag{ σ 2 β (H H) −1 )} i /n. Both estimators are unbiased. Hence (10) holds. The two-sided (1 − α) confidence interval of ∆ thus can be constructed as based on Proposition 3.2. Remark 3.1 The normality of β i in (4) can be relaxed to other distributions with mean ∆ and variance σ 2 , which leads to an asymptomatic normality of ∆ as when n → ∞, by the central limit theorem. Yet, in reality, we have finite and usually a fairly small number of teams in one league, such as n = 20 for EPL. Therefore we choose to keep the normality assumption. In this section, simulation studies are conducted to evaluate the empirical performance of estimators for both β's and ∆. We consider two data generation scenarios for α ij in our simulation. In the first scenario, we generate ability difference between team T i and team T j In total, we generate 1, 000 independent replicates for each scenario. The performance of the estimates are evaluated by the bias, the coverage probability (CP), the sample variance of the 1,000 estimates (SV), and the mean of the 1,000 variance estimates (MV) in the following way; take ∆ as an example: 1−α/2 ), and Var(∆ (i) ) are the point estimate, (1 − α)-level confidence bands, and variance estimate from the ith replicated simulation data, respectively. We set α = 0.05 throughout the simulation. We first examine the simulation results for ∆, as well as the variance estimator of ∆. Table 2 summarizes the performance of the ∆ estimator in different data generation scenarios, and reports the bias, coverage probability, sample variance of the 1,000 estimates and mean of the 1,000 variance estimates. From this table, it is clear that our proposed estimator for ∆ has a very small bias that decreases quickly as n increases. The coverage probabilities are also very close to the desired 95% nominal coverage level, especially when n ≥ 20. Note that the standard error of the estimated coverage probabilities is .05 × .95/1, 000 = 0.0069. The last two columns also confirm the accurate variance estimation for ∆ under all the simulation scenarios; and the estimation accuracy becomes better as n increases. Next we examine the simulation results for β's, as well as the coverage probability of β's. Figures 4 to 7 summarize the boxplots of all βs' estimators in different data generation process to show estimation performance. It is clear from these figures that the estimator is essentially unbiased for all scenarios considered in the simulation. Furthermore, we (1) (2) (3) The coverage probability in general becomes more accurate as the sample size n increases. In summary, the simulation results confirm the excellent performance (e.g., bias, variance estimate, and coverage probabilities) of our method in terms of estimating both ∆ and β's. Note that our simulation scenarios also include the situation when α ij are not i.i.d. We focus on the eleven selected in-game statistics collected from 20 teams in 2020-2021 English Premier League as described in Section 2. For each statistic, we treat it as the main outcome and apply the proposed causal inference method to calculate the team-specific home field advantage β i and the league wide home field advantage ∆ to further analyze the implications of the home field advantage on a team-by-team basis and a statistic-specific basis. Figure 9 shows the 95% confidence intervals for the β i . A β = 0 would indicate no home field advantage. Brighton is clearly a standout team here, with a wide confidence interval across all statistics. In Figure 9 , each of the eleven statistics had a range of 3-6 teams out of 20 that were significant for that particular statistic. While this is clearly not a majority, it does give us some data on the type of statistic that retains the home field advantage, which we will discuss more in the next section. Conversely, we can look at the significance of the β i by team, which will continue to build on to the story. The Figure 10 shows the p-values of the β i 's for each team (x-axis and color) and for each in-game statistic (denoted by shape). While a majority of the statistics are above the 0.05 standard threshold for significance, there are some teams that have a majority of the in-game statistics being significant. We now look at the makeup of those particular teams and begin to make inference on what makes those specific teams susceptible to the phenomenon that is the home field advantage. The teams that had the highest number of significant statistics were Fulham, Brighton, Newcastle United and Wolverhampton Wanderers. This prompted us to ask what these teams had in common that the other teams did not that caused them to reap the benefits of the home field advantage. The most obvious and telling common factor amongst the teams was their records-they all were ranked in the bottom 50% at the end of the season, with Fulham being one of the three relegation teams in the English Premier League that year. This could imply that less performant teams reap higher benefits of home-field advantage than teams that excel. Teams that perform in the top rankings of the English Premier League can dominate their opponents regardless of their match location. Perhaps their talent and skill prevails such that the difference in their statistics when they are home versus when they are away is nearly indiscernible. In contrast, teams that are not as talented or skilled retain every advantage from being at home -from increased confidence due to crowd involvement, to familiarity, and to a lack of travel fatigue. The ∆ values tell us the estimated home field advantage that a specific statistic awards across the league. Significant ∆ values give us insight to the statistics that have the highest impact on the home field advantage and how much they contribute. Table 3 shows the estimated ∆ values for each of the in-game statistics, their estimated standard deviations, and their p-values for significance. Seven of the eleven statistics we chose to analyze proved to significantly have an effect on the net increase (or decrease) of a statistic in favor of the home team. We notice that offense based statistics, such as attacks with shot and reaching opponent box, are significant at α = 0.05; while defense based statistics, such as shots blocked and defence interceptions, are not significant, as well as the referee based statistic, yellow cards. This could be an indication that teams retain an advantage offensively when they compete at their home field. Going back to the causes of home-field advantage and examining the statistics, it appears that familiarity of the field would have an impact on the outcomes of the offensive statistics. For example, attacks with shot, reaching the opponent's half or box, shots from box or danger zone, and successful key passes are statistics that are based on the players' ability to get into scoring position, which would increase their likelihood of making a goal. The statistic goal itself did not show to hold a significant home field advantage, which allows us to infer that perhaps the number of goals scored by the home team may not be more than that of the away team, but the opportunities that are presented due to the causal factors of the home field advantage phenomenon are significantly greater. That is, the quality of play is better for the home team than the away team. Our research found significant measurement of the home field advantage in various soccer statistics. The home field advantage resides more heavily in offensive-based statistics than it does in defensive or referee based statistics. This does not illuminate the importance of defense over offense, but rather that the home field advantage phenomenon is not as prominent in the defensive side of soccer. We found no home field advantage exhibited by the officials of the game, a positive indication of unbiasedness in the sport. We elected to deeply analyze the eleven statistics that we chose based on their ∆ values and β values. The statistics that showed significance had less to do with the stand-out statistics of soccer goals, free kicks, fouls, etc. These statistics were more based in quality of play. For example, successful key passes and attacks with shots are statistics based on the quality of the action. What we can derive from this is that the home team is not necessarily benefiting in terms of the obvious, but in the details. The home team takes advantage in being familiar enough with their field such that they get more shots off of attacks and passes, and reach the opponent's box more times. While this may not show up in the box score, it nevertheless can give home team an edge. Further, we discovered that teams that performed poorly in the season and had lower rankings in the English Premier League retained higher home field advantage. Less successful teams would likely be more confident and comfortable playing in their home environment, while the highly successful teams would be confident playing anywhere. There are several possible directions for further investigations. First, considering the specific factors such as crowd involvement, familiarity of facilities, and travel fatigue that make up the home field advantage denotes an interesting future direction. Furthermore, building multiple comparison procedure and multivariate response model would merit future research from both methodological and applied perspectives. In addition, our current model depends on the normality assumption for responses. Proposing distribution-free estimators as well as inference procedure will broaden the applications in different areas. From application point of view, studying the in-game statistics over the Covid-19 season in comparison to previous years would help to identify the strength of crowd involvement in the home field advantage. Large sample properties of matching estimators for average treatment effects Matching on the estimated propensity score Handbook of statistical methods and analyses in sports Estimating average treatment effects: Supplementary analyses and remaining challenges Bayesian hierarchical model for the prediction of football results Doubly robust estimation in missing data and causal inference models Estimating the change in soccer's home advantage during the covid-19 pandemic using bivariate poisson regression Bayesian hierarchical modeling of patient subpopulations: efficient designs of phase ii oncology clinical trials Brief ideas about evidence-based recovery in team sports Improving efficiency and robustness of the doubly robust estimator for a population mean with incomplete data Improved NCAA basketball tournament modeling via point spread and team strength information Dynamic Bradley-Terry modelling of sports tournaments Predicting discrete outcomes with the maximum score estimator: The case of the NCAA men's basketball tournament Pointwise: Predicting points and valuing decisions in real time with nba optical tracking data A Bayesian basket trial design using a calibrated Bayesian hierarchical model The home advantage in sport competitions: a literature review Causal inference: A missing data perspective Modelling association football scores and inefficiencies in the football betting market Wide open spaces: A statistical technique for measuring space creation in professional soccer Does crowd support drive the home advantage in professional football? evidence from german ghost games during the covid-19 pandemic Characterizing the spatial structure of defensive skill in professional basketball Mixture of finite mixtures model for basket trial Spectator booing and the home advantage: A study of social influence in the basketball arena Full matching in an observational study of coaching for the sat Matching as an econometric evaluation estimator: Evidence from evaluating a job training programme A generalization of sampling without replacement from a finite universe Bayesian group learning for shot selection of professional basketball players Zero-inflated poisson model with clustered regression coefficients: Application to heterogeneity learning of field goal attempts of professional basketball players Nonparametric estimation of average treatment effects under exogeneity: A review Home field advantage calculation for physical education and sports students The home field advantage in athletics: A meta-analysis Analysis of sports data by using bivariate poisson models No fans-no home advantage', Sport Psychological Effects of Missing Supporters on Football Teams in European Top Leagues Evidence of reduced home advantage when a team moves to a new stadium Effect of time zone and game time changes on team performance: National football league Estimation of regression coefficients when some regressors are not always observed Optimal matching for observational studies The central role of the propensity score in observational studies for causal effects Matching to remove bias in observational studies Estimating causal effects of treatments in randomized and nonrandomized studies Matched sampling for causal effects An analysis of home advantage in the top two spanish professional football leagues Measuring spatial allocative efficiency in basketball The home advantage Modeling offensive player movement in professional basketball