key: cord-0143038-r5azbrpd
authors: Bond, A. J.; Beggs, C. B.
title: Bisecting for selecting: using a Laplacian eigenmaps clustering approach to create the new European football Super League
date: 2021-04-20
journal: nan
DOI: nan
sha: 798ff4dcfe26d78cbd1cb49274612a76a13e834a
doc_id: 143038
cord_uid: r5azbrpd

We use European football performance data to select teams to form the proposed European football Super League, using only unsupervised techniques. We first used random forest regression to select important variables predicting goal difference, which we used to calculate the Euclidian distances between teams. Creating a Laplacian eigenmap, we bisected the Fielder vector to identify the five major European football leagues' natural clusters. Our results showed how an unsupervised approach could successfully identify four clusters based on five basic performance metrics: shots, shots on target, shots conceded, possession, and pass success. The top two clusters identify those teams who dominate their respective leagues and are the best candidates to create the most competitive elite super league.

Operational research (OR) has a long history of using sport to explore operational insights and methodologies (see Wright, 2009 for a review). Recently association football (herein football, also known as soccer), has become a popular data source to understand a range of operational issues like scheduling (Durán et al., 2017; Yi et al., 2020) ; rating and valuing (Baker & McHale, 2018; Kharrat et al., 2020; Müller et al., 2017) ; managerial succession (Beggs & Bond, 2020; Flores et al., 2012) ; events planning (Ghoniem et al., 2017) ; performance prediction (Beggs et al., 2019; Butler et al., 2021) , to name a few. While data availability makes football attractive to operational researchers, football needs operational research to drive decision-making. For example, currently, there are substantial plans to disrupt European football operations to introduce a new elite European Super League similar to the EuroLeague established in basketball (West, 2018) . Indeed, this is not a new proposal, it has rumbled under the surface since the nineties, but with a reported $6 billion funding package from investment firm JP Morgan Chase, it looks more likely (Conn, 2021; McInnes, 2020) .

With annual revenues of approximately $28 billion per annum (Deloitte, 2020), European football (or soccer) presents a world of high finance. In addition to high revenue-generating capacity, it is considered 'recession proof' (Dimitropoulos et al., 2016) . While the entertainment industry deals with significant losses throughout the COVID-19 pandemic, European football performed reasonably well with an estimated $26billion revenue, losing roughly twelve per cent (Deloitte, 2021) . European football's sustained revenue-generating capacity predominantly stems from the global demand for the five major leagues; English Premier League, Spanish La Therefore, selecting the right clubs to form a new league is imperative to a successful league and ensure a return on any investments. While there are numerous ways to measure competitiveness, these are often based on win percentages or points allocations (see Lee et al., 2019a Lee et al., , 2019b Owen & King, 2015) . More importantly, they mainly provide an overall measure to make comparisons over time or compare other sports leagues and provide little information about the 'who' is dominant in the market. Therefore, we propose a different, unsupervised data-driven approach to identify 'who' dominates the European leagues. Indeed, there are several attempts to classify or rank European football teams, such as; the Euro Club Index (EuroClubIndex.com, 2021), the ClubElo index (clubelo.com, 2021) , and the UEFA club coefficient rankings (UEFA, 2021) . However, these tend to rely on either established rating systems such as the Elo system (Elo, 1978) , or the awarding of arbitrary points for progression in cup competitions. They reveal nothing about the natural groupings (clusters) that might exist in European football. Our approach is to blend machine learning and graph theory approaches to identify natural clusters of teams in Europe's five major leagues, using simple performance data. Not only does this provide an innovative approach to sport operations, allowing a more objective approach to selecting teams into the new league, but it also offers a methodology to allow operational researchers to identify natural clusters of firms within markets. 

The study aimed to develop a methodology for identifying natural groupings between teams in the various European soccer leagues, using season match data alone (excluding goals scored or conceded). We performed an exploratory analysis using basic univariate analysis on the variables used in this study before conducting a random forest regression analysis to identify the measured variables that best predicted the goal difference for the respective soccer teams. Goal difference was used because it is a better measure of team performance and less susceptible to bias than points total, which is influenced by the number of teams in the respective leagues (Heuer & Rubner, 2009 ).

Having identified the variables that best predicted end-of-season goal difference, we then computed the Euclidean distances between the respective teams in the vector space and used them to produce Laplacian eigenmaps of the data (Belkin & Niyogi, 2003) . Laplacian eigenmaps are constructed from the eigenvectors of a graph Laplacian matrix. They are essentially an embedding algorithm that seeks to project pair-wise proximity information onto a low dimensional space to preserve local structures in the data. Unlike linear

An exploratory random forest regression was performed to assess the observed variables' relative importance as predictors of goal difference. Random forest analysis is an ensemble classification technique popular in machine learning that generalises classification trees (Boinee et al., 2008; Breiman, 2001) . It is a robust technique resistant to over-fitting and does not require strict distributional assumptions (Breiman, 2001; Izenman, 2013) . Crucially, it has the great advantage that it can assess variable importance, thus enabling the user to discard redundant variables that do not assist in the prediction process.

Random forest models produce many regression trees that use recursive partitioning of data to group observations into predefined classes through binary splitting the predictor variables (Hansen et al., 2015) . Bias and over-fitting are minimised by employing a combination of bootstrap bagging and utilising a random subset of predictor variables (generally the square root of the total number of predictors in the model) at each split.

Each regression tree in the random forest is built using a bootstrapping algorithm, which randomly 'bags' a sample from approximately two-thirds of the data for training purposes, leaving the remaining one-third of the cases or out-of-bag (OOB) cases to assess the performance of the regression tree (Boinee et al., 2008; Cutler et al., 2007) . For each tree, the prediction error -mean squared error (MSE) in the case of a regression treeis computed. These are then pooled to give an overall measure of classification accuracy, thus ensuring that the assessment is unbiased (Pecl et al., 2011) .

We used the 'randomForest' package (Liaw & Wiener, 2002) in R (R Core Team, 2020) to perform a random forest analysis involving the creation of 500 random trees. Initial analysis was undertaken using all thirteen predictor variables to identify those variables that significantly influenced the outcome variable,

Goal_Difference. The 13 predictor variables used to predict goal difference were; shots on target; possession;

shots; shots conceded; pass_success; dribbles; aerials won; offsides; tackles; yellow cards; red cards; fouls; fouled; interceptions, described in Table 1 . The relative importance of the variables was assessed using the Gini variable importance measure (VIM), which we corrected for bias using the heuristic strategy proposed by Zuccolotto (2008, 2010) and implemented by Carpita et al. (2014) . For every node split in a tree, the Gini impurity criterion (which assesses the data's heterogeneity) for the two descendent nodes is less than that of the parent node (Friedman, 2001) . Therefore, adding up the Gini decreases for each variable over all trees in the forest, it is possible to achieve a measure of variable importance. In our analysis, variables that exceeded the inflexion point's value on the Gini VIM curve were deemed to be influential and thus retained when the random forest model was refined. Having identified the key variables that best predicted goal difference, we then repeated the random forest analysis using the refined model to understand the prediction accuracy that could be achieved. Prediction of the respective teams' goal differences was then performed using the refined model and an ensemble prediction algorithm that aggregated 500 predictions. Because random forests use a self-validating MSE rate, there is strictly no need for cross-validation or a separate validation test to obtain an unbiased estimate of model error (Pecl et al., 2011) . However, to demonstrate the refined random forest model's validity, we performed k-fold cross-validation using ten randomly sampled 'folds' of approximately equal size.

Spectral cluster analysis was performed using a Laplacian eigenmaps method to visualise relationships between the respective teams and identify natural sub-groups within the data, as described by Belkin & Niyogi (2003) . This approach involves computing the pair-wise Euclidean distances between the respective teams using the key variables identified by the random forest analysis. These were transformed into a [150 x 150] similarity matrix, Q, using a Gaussian radial basis function (rbf) kernel (Schölkopf et al., 2004) , with =1, as follows:

where; E is the matrix of pair-wise Euclidean distances. The non-linear Gaussian function filtered the Euclidean distance matrix so that edges between close neighbours were given more weight compared with those between teams more distantly separated. From this, the modified similarity matrix, W, was constructed by subtracting the [150 x 150] identify matrix, I, from the similarity matrix, Q:

This was then be used to construct the degree matrix, D, as follows:

= .

( 3) where, n is a [150 x 1] vector of ones and D is:

Having computed the degree matrix, D, the Laplacian, L, and normalised Laplacian, Lnorm, matrices (both symmetric, positive semi-definite matrices) were then constructed (Chung & Graham, 1997; Qiu & Hancock, 2004; von Luxburg, 2007) , as follows:

After this, eigendecomposition of the normalised Laplacian matrix, Lnorm, was performed in order to compute the diagonal matrix of eigenvalues, L, and the matrix of eigenvectors, V, as follows:

However, unlike PCA, where the eigenvectors corresponding to the largest eigenvalues are used to construct the principal components, Laplacian eigenmaps construct a configuration from the eigenvectors corresponding to the two or three smallest positive eigenvalues. Because the smallest eigenvalue equals zero, the eigenvector corresponding to this eigenvalue is often ignored, and instead, the eigenvectors associated with the successive two or three smallest positive eigenvalues are used to construct the map (Qiu & Hancock, 2004) . In our case, we used the last three positive eigenvectors, fourth, third and second (Fielder) smallest eigenvectors, to construct 3D Laplacian eigenmaps of the European football teams. We used third and Fielder vectors to construct 2D Laplacian eigenmaps.

Laplacian eigenmaps are considered to be a spectral clustering technique. As such, it exhibits a critical property first discovered by Fiedler (1975) , namely that the eigenvector associated with the second smallest eigenvalue To identify how many bisections were appropriate to establish the natural clusters in the data, we ran a cluster validation using the 'clValid' package in R (Brock et al., 2008) . To do so, we used the self-organising maps algorithm (Kohonen, 1991 (Kohonen, , 2012 since it is an unsupervised learning technique partitioning data using artificial neural networks. To determine the suitability of 2 -6 partitions of the fielder vector, internal consistency was measured by the Dunn Index (Dunn, 1974) and Silhouette Width (Rousseeuw, 1987) , both of which should be maximised (see Handl et al., 2005 , for a review). The Silhouette Widths were also used to inspect final cluster classifications, following the Fielder vector's bisection. To visualise natural clustering, we created an undirected graph network using the inverse of the Euclidean distances between the respective teams.

The descriptive analysis results using the aggregated data split by benchmark percentiles (top, middle, bottom) and the one-way ANOVA are presented in Table 1 . Unsurprisingly, the top teams had significantly greater possession and pass success; conceded fewer shots; made more dribbles and shots than weaker teams (all p<0.001); had greater possession and pass success (both p<0.001); and made more dribbles and shots (both p<0.001), compared with the weaker teams. In addition, they made significantly fewer fouls (p = 0.037) but did not significantly receive less yellow (p = 0.214) and red (p = 0.406) cards. 

The exploratory random forest analysis incorporating all the predictor variables produced a regression model with an MSE of 115.62 and an r 2 value of 0.7875 (or 78.75% explained variance), which was used to assess variable importance (see Figure 1 ). From Figure 1 , it can be seen that the Gini VIM values for the five variables:

Shots_OT (on target); Possession; Shots_conceded; Shots; and Pass_Success, were far more than the values for the other variables, which were subsequently discarded from the refined random forest regression model.

As such, this indicates that these five variables were the best predictors of end-of-season goal difference. (14) Fouls

The refined random forest analysis utilising only these important variables produced a regression model with an MSE of 113.84 and an r 2 value of 0.7908 (79.08% variance explained). This was confirmed by the 10-fold cross-validation process, which found the cross-validation MSE to be 113.44, with a cross-validation r 2 value of 0.7908 (79.08% variance explained. The relationship between predicted and actual goal difference for the respective clubs is shown in Figure 2 . From this, it can be seen that the refined random forest model predicted the end-of-season goal difference with a high degree of accuracy. Figure 2 . Scatter plot of predicted goal difference versus actual goal difference for the refined random forest regression model

The 3D Laplacian eigenmaps of the teams are presented in Figure 3 , which shows a scatter plot of the three smallest positive eigenvectors. The 3D plots demonstrate a spiral-like curve between the three dimensions, demonstrating a hierarchal structure. Figure 4 shows the 2D Laplacian eigenmap with the Fielder vector plotted against the third smallest eigenvector. Here it shows a characteristic U-shaped curve, with the teams distributed along its length. 

The Dunn Index and Silouhette Width results for the self-organising maps cluster validation are presented in The three bisections of the Fielder vector are presented in Figure 7 , creating 4 clusters SC1-SC4. Here the clusters demonstrate a group of four very strong dominating teams (SC1), fifteen strong teams (SC2), thirty-seven medium-strength teams (SC3) and ninety-four weaker teams (SC4). Overall, the natural clusters identified by the Fielder vector algorithm are well defined, with an average Sillouhette Width = 0.61, and no cluster below 0.50 ( Figure 8 and Table 2) , and a Dunn Index = 0.0098. The lowest internally valid cluster is SC2 with a Silhouette Index = 0.50, suggesting this group is more heterogeneous than homogeneous. The natural clustering network graph is visualised in Figure 9 , which shows how cohesive the clusters are based on the inverse Euclidian distance. 2, 55, 50, 15, 54, 42, 86, 47, 21, 4, 73, 59, 69, 43 0.50 SC3 37 133, 147, 135, 120, 150, 105, 67, 148, 68, 49, 84, 76, 20, 75, 81, 78, 40, 98, 14, 72, 85, 70, 61, 28, 16, 46, 74, 11, 89, 6, 94, 144, 79, 44, 56, 124, 8 0.65

Using the Fielder vector allows natural groupings of teams (or firms) to be created, and the results show that using the Fielder vector algorithm is relatively effective in finding natural clusters within European football teams. By using unsupervised machine learning and clustering methods, we can objectively identify the dominant teams across Europe. Therefore, clusters 1 and 2 demonstrate the best teams to compete in an elite European Super League -should it be created. Indeed, this methodology can be applied to multiple applications within operational research generally. Within a similar context to this paper, the approach can be applied to understanding which players naturally cluster together based on performance metrics to aid decision-making regarding player acquisitions and development.

Likewise, it could support merger and acquisition decisions by identifying creditable firms or understanding the impact of strategic choices when creating competitive advantage.

Concerning the specific question of 'who' are the top teams in European soccer, the Laplacian eigenmap methodology classified 15 out of the 16 'breakaway' teams as candidates for the elite league (Der Spiegel, 2018). However, our approach did not select Atlético Madrid and instead selected Napoli, Tottenham Hotspur, Lyon and Fiorentina to the elite European Super League. However, as with any research, there are limitations;

firstly, we only consider simple performance metrics, and with the advances in data quality, there are opportunities for more advanced metrics to identify similarities and differences from a performance perspective. Secondly, we do not assess whether the teams selected by bisecting the Fielder vector would make for a competitive league. Thus, further OR research should look to take advantage of more granular and advanced performance data to classify sports teams and forecast what this new elite league will look like with the teams selected here.

Declarations of interest: none. 

Football: Documents Show Secret Plans for Elite League of Top Clubs -DER SPIEGEL

Managing the European football industry: UEFA's regulatory intervention and the impact on accounting quality

Well-Separated Clusters and Optimal Fuzzy Partitions

Scheduling the South American Qualifiers to the 2018 FIFA World Cup by integer programming

The rating of chessplayers, past and present

Latest Ranking -Euro Club Index

A property of eigenvectors of nonnegative symmetric matrices and its application to graph theory

Decision taking under pressure: Evidence on football

Spectral simplification of graphs. European Conference on Computer Vision

R: A language and environment for statistical computing. R Foundation for Statistical Computing

The Financial Impact of (Foreign) Private Investors on Team Investments and Profits in Professional Football: Empirical Evidence from the Premier League

Silhouettes: A graphical aid to the interpretation and validation of cluster analysis

A bias correction algorithm for the gini variable importance measure in classification trees

Analysis and correction of bias in Total Decrease in Node Impurity measures for tree-based algorithms

Kernel Methods in Computational Biology

On the Fiedler vectors of graphs that arise from trees by Schur complementation of the Laplacian

European football leagues to oppose breakaway Super League, saying domestic game is at the heart of game

Club coefficients . UEFA

A tutorial on spectral clustering

European Super League: Will football follow basketball's lead? -BBC Sport

Football Statistics

The relationship between ownership structure and club performance in the English Premier League. Sport, Business and Management

50 years of OR in sport

Proactive and reactive strategies for football league timetabling