key: cord-0445067-xrmor9bq authors: Hamilton, Ian; Firth, David title: Retrodictive Modelling of Modern Rugby Union: Extension of Bradley-Terry to Multiple Outcomes date: 2021-12-21 journal: nan DOI: nan sha: d0545264501b55ad67036b89c076e4be0df08d0c doc_id: 445067 cord_uid: xrmor9bq Frequently in sporting competitions it is desirable to compare teams based on records of varying schedule strength. Methods have been developed for sports where the result outcomes are win, draw, or loss. In this paper those ideas are extended to account for any finite multiple outcome result set. A principle-based motivation is supplied and an implementation presented for modern rugby union, where bonus points are awarded for losing within a certain score margin and for scoring a certain number of tries. A number of variants are discussed including the constraining assumptions that are implied by each. The model is applied to assess the current rules of the Daily Mail Trophy, a national schools tournament in England and Wales. There is a deep literature on ranking based on pairwise binary comparisons. Prominent amongst proposed methods is the Bradley-Terry model, which represents the probability that team i beats team j as where π i may be thought of as representing the positive-valued strength of team i. The model was originally proposed by Zermelo (1929) before being rediscovered by Bradley and Terry (1952) . It was further developed by Davidson (1970) to allow for ties (draws); by Davidson and Beaver (1977) to allow for order effects, or, in this context, home advantage; and by Firth (https://alt-3.uk/) to allow for standard association football scoring rules (three points for a win, one for a draw). Bühlmann and Huber (1963) showed that the Bradley-Terry model is the unique model that comes from taking the number of wins as a sufficient statistic. Later, Joe (1988) showed that it is both the maximum entropy and maximum likelihood model under the retrodictive criterion that the expected number of wins is equal to the actual number of wins, and derived maximum entropy models for home advantage and matches with draws. These characterisations of the Bradley-Terry model may be seen as natural expressions of a wider truth about exponential families, that if one starts with a sufficient statistic then the corresponding affine submodel, if it exists, will be uniquely determined and it will be the maximum entropy and maximum likelihood model subject to the 'observed equals expected' constraint (Geyer and Thompson, 1992) . In this paper, the maximum entropy framing is used as it helps to clarify the nature of the assumptions being made in the specification of the model. Situations of varying schedule strength occur frequently in rugby union. They are apparent in at least five particular scenarios. First, in two of the top club leagues in the world -Pro14 and Super Rugby -the league stage of the tournament is not a round robin, but a conference system operated with an over-representation of matches against teams from the same conference and country. Second, in professional rugby such situations occur at intermediate points of the season, whether the tournament is of a round robin nature or not. Third, in Europe, a significant proportion of teams in the top domestic leagues -Pro14, English Premiership, Top14 -also compete in one of the two major European rugby tournaments, namely the European Rugby Champions Cup and the European Rugby Challenge Cup. The preliminary stage of both these tournaments is also a league-based format. If the results from the European tournaments are taken along with the results of the domestic tournaments then a pan-European system of varying schedule strengths may be considered. Fourth, fixture schedules may be disrupted by unforeseen circumstances causing the cancellation of some matches in a round robin tournament. This has been experienced recently due to COVID19. Fifth, schools rugby fixtures often exist based on factors such as geographical location and historical links and so do not fit a round robin format. The Daily Mail Trophy is an annual schools tournament of some of the top teams in England and one team from Wales that ranks schools based on such fixtures. In modern rugby union the most prevalent points system is as follows: 4 points for a win 2 points for a draw 0 points for a loss 1 bonus point for losing by a match score margin of seven or fewer 1 bonus point for scoring four or more tries This is the league points system used in the English Premiership, Pro14, European Rugby Champions Cup, European Rugby Challenge Cup, the Six Nations 1 , and the pool stages of the most recent Rugby World Cup, which was held in Japan in 2020. In the southern hemisphere, the two largest tournaments, Super Rugby and the Rugby Championship, follow the same points system except that a try bonus point is awarded when a team has scored three more tries than the opposition, so at most one team will earn a try bonus. In the French Top14 league the try bonus point is also awarded on a three-try difference but with the additional stipulation that it may only be awarded to a winning team. The losing bonus point in the Top14 is also different in awarding the point at a losing margin of five or fewer instead of seven or fewer. Together these represent the largest club and international tournaments in the sport. For the rest of this paper the most prevalent system, the one set out above, will be used. The others share the same result outcomes formulation and hence a substantial element of the model. When appropriate, methodological variations will be mentioned that might better model the alternative try-bonus method, where the bonus is based on the difference in the number of tries and may only be awarded to one team. For the avoidance of confusion, for the remainder of this paper, the points awarded due to the outcomes of matches and used to determine a league ranking will be referred to as 'points' and will be distinguished from the in-game accumulations on which match outcomes are based, which will henceforth be referred to as 'scores'. Likewise, 'ranking' will refer to the attribution of values to teams that signify their ordinal position, while 'rating' will refer to the underlying measure on which a ranking is based. It is not the intention here to make any assessment of the relative merits of the different points systems, rather to take the points system as a given and to construct a coherent retrodictive model 1 In the Six Nations there are an additional three bonus points for any team that beats all other teams in order to ensure that a team with a 100% winning record cannot lose the tournament because of bonus points consistent with that, whilst accounting for differences in schedule strength. In doing so, a model where points earnt represent a sufficient statistic for team strength is sought. It is important to understand therefore that this represents a 'retrodictive' rather than a predictive model. This is a concept familiar in North America where the KRACH ("Ken's Rating for American College Hockey") model, devised by Ken Butler, is commonly used to rank collegiate and school teams in ice hockey and other sports (Wobus, 2007) . The paper proceeds in Section 2 with derivation of a family of models based on maximum entropy, a discussion of alternatives, and the choice of a preferred model for further analysis. Section 3 proposes estimation of the model through a loglinear representation. In addition, the implementation of a more intuitive measure of team strength, the use of a prior, and an appropriate identifiability constraint, are also discussed. In Section 4 the model is used to analyse the current Daily Mail Trophy ranking method, and in Section 5 some concluding remarks are made. The work of Jaynes (1957) as well as the Bradley-Terry derivations of Joe (1988) and Henery (1986) suggest that a model may be determined by seeking to maximise the entropy under the retrodictive criterion that the points earnt in the matches played are equal to the expected points earnt given the same fixtures under the model. Taking the general case, suppose there is a tournament where rather than a binary win/loss there are multiple possible match outcomes. Let p ij a,b denote the probability of a match between i and j resulting in i being awarded a points and j being awarded b, with m ij the number of matches between i and j. Then we may define the entropy as This may be maximised subject to the conditions that for each pair of teams the sum of the probabilities of all possible outcomes is 1, and the retrodictive criterion that for each team i, given the matches played, the expected number of points earnt is equal to the actual number of points earnt, where m ij a,b represents the number of matches which result in i being awarded a points and j being awarded b. The entropy, S(p), is strictly concave and so the Lagrangian has a unique maximum. With λ ij being the Lagrange multiplier associated with teams i, j in condition (2), and λ i those for the retrodictive criterion applied to team i from condition (3), then the solution satisfies which gives us that where the π i = exp(−λ i ) may be used to rank the teams, and exp(−λ ij − 1) is the constant of proportionality. This result holds for i, j such that m ij > 0 and a reasonable modelling assumption is that it may then be applied to all pairs (i, j). This derivation presents the most general form of the maximum entropy model, but various specific models may be motivated in this way by imposing a variety of independence assumptions or additional conditions. Some of the main variants are considered next. Conceptually one might consider that points awarded for result outcomes and the try bonus are for different and separable elements of performance within the predominant points system that is being considered here. If that were not the case then a stipulation similar to that imposed in Top14, which explicitly connects the try bonus and the result outcome, could be used. While the result outcome in rugby union is commonly presented as a standard win, draw, loss plus a losing bonus point, it may be thought of equivalently as five possible result outcomes -wide win, narrow win, draw, narrow loss, wide loss. This leads to a representation of the five non-normalised result probabilities as P (team i beats team j by wide margin) ∝ π 4 i P (team i beats team j by narrow margin) ∝ ρ n π 4 i π j P (team i draws with team j) ∝ ρ d π 2 i π 2 j P (team j beats team i by narrow margin) ∝ ρ n π i π 4 j P (team j beats team i by wide margin) ∝ π 4 j , where ρ n and ρ d are structural parameters related to the propensity for narrow or drawn result outcomes respectively. Taking the conventional standardisation of the abilities that the mean team strength is 1, as in Ford Jr (1957) , then the probability of a narrow result outcome (win or loss) in a match between two teams of mean strength is 2ρ n /(2 + 2ρ n + ρ d ), and that for a draw outcome is ρ d /(2 + 2ρ n + ρ d ). A nice feature for this particular setting is that the try bonus point provides information on the relative strength of the teams, so that the network is more likely to be connected. It may thus supply differentiating information on team strength even where more than one team has a 100% winning record. There are four potential try bonus outcomes that may be modelled by the probabilities: P (team i and team j both awarded try bonus point) ∝ τ b π i π j P (only team i awarded try bonus point) ∝ π i P (only team j awarded try bonus point) ∝ π j P (neither team awarded try bonus point) ∝ τ z , so that in a match between two teams of mean strength the probability of both being awarded a try bonus is τ b /(2 + τ b + τ z ) and that for neither team gaining a try bonus is τ z /(2 + τ b + τ z ). This model would be derived through a consideration of entropy maximisation by taking the result outcome and try bonus outcome as separable maximisations, but then enforcing that the π i are consistent. Each of the structural parameters may be derived by an appropriate additional condition. For example in the case of ρ d , the relevant condition would be that, given the matches played, the expected number of draws is equal to the actual number of draws. One might choose to make an even stronger independence assumption, that the probability of gaining a try bonus is solely dependent on a team's own strength and independent of that of the opposition. This has the advantage of greater parsimony. It may be expressed as P (team i gains try bonus point) ∝ τ π i P (team i does not gain try bonus point) ∝ 1 , where τ /(1+τ ) is the probability that a team of mean strength gains a try bonus. This model would clearly not be appropriate to the southern hemisphere system where the try bonus was awarded for scoring three more tries than the opposition. Alternatively, the try bonus could be conditioned on the result outcome. This of course would be necessary if modelling the points system employed in the Top14 for example, where only the winner is eligible for a try bonus. The conditioning could be done in a number of ways. One could consider the five result outcomes noted already, or consider a simplifying aggregation, either into wide win, close result (an aggregation of narrow win, draw, and narrow loss), or wide loss; or win (an aggregation of narrow win and wide win), draw, or loss (narrow loss and wide loss). It seems not unreasonable to consider that a team's ability to earn a try bonus point might be modeled as being dependent on its own attacking strength and the opposition's defensive strength and independent of its own defensive strength and the opposition's attacking strength. This may be captured by considering team strength to be the product of its offensive and defensive strength where we consider the probability of a team i scoring a try bonus in a match as proportional to their offensive strength parameter ω i , and the probability of them not conceding a try bonus as proportional to their defensive strength parameter δ i . Given π i , only one further parameter per team need be defined and so the non-normalised try bonus outcome probabilities may be expressed as Thus the model replaces the symmetric try bonus parameters τ b and τ z with team-dependent parameters. This may be derived from an entropy maximisation by considering the try bonus outcome independently from the result outcome. The familiar retrodictive criterion that for each team, the expected number of try bonus points scored is equal to the actual number of try bonus points scored, is then supplemented by a second criterion that for each team, the expected number of matches where no try bonus is conceded is equal to the actual number of matches where no try bonus is conceded. See Appendix for details. Home advantage could be parametrised in several ways. Following the example of Davidson and Beaver (1977) and others, one possibility is to use a single parameter, for example by applying a scaling parameter to the home team and its reciprocal to the away team. This may be derived via entropy maximisation with the inclusion of a condition that, given the matches played, the difference between the expected points awarded to home teams and away teams is equal to the actual difference. An alternative explored by Joe (1988) is to consider the home advantage of each team individually so that the rating for each team could be viewed as an aggregation of their separate home team and away team ratings. See Appendix for details. In Section 2.2, four different possible result and try-bonus models were presented, each representing different assumptions around the independence of these points as they related to team strengths. Additionally two possible home advantage models were discussed. The assumption of the independence of the result outcomes and try bonus of Section 2.2.1 was proposed based on the conceptual separation of result outcome and try bonus inherent in the most prevalent points system. Clearly, for a scenario such as that faced in the Top14 where try bonus is explicitly dependent on result outcome, a version of the dependent models presented in Section 2.2.3 would be required. However, for modelling based on the most prevalent points system, the introduction of so many additional structural parameters seems unwarranted compared to the greater interpretability of the independent model of Section 2.2.1. On the other hand, the more parsimonious model from taking the try bonus of the two teams as independent events requires only one less structural parameter. In work available in the Appendix, it was found to substantially and consistently have weaker predictive ability compared to the opposition-dependent try bonus model of Section 2.2.1 when tested against eight seasons of English Premiership rugby results. While predictive ability is not the primary requirement of the model, it was in this case considered a suitable arbiter, and so the opposition-dependent try bonus model of Section 2.2.1 is preferred. Both the offensive-defensive model and the team specific home advantage model require an additional parameter for every team. There may be scenarios where this is desirable but given the sparse nature of fixtures in the Daily Mail Trophy they do not seem to be justified here, and so the combination of a single strength parameter for each team and a single home advantage parameter is chosen for the model in this case. For a match where i is the home team, and j the away team, the model for the result outcome may be expressed as and for the try bonus point as P (team i and team j both gain try bonus point) ∝ τ b π i π j P (only team i gains try bonus point) ∝ κπ i P (only team j gains try bonus point) ∝ π j κ P (neither team gains try bonus point) ∝ τ z , where κ is the home advantage parameter, and ρ n and ρ d are structural parameters related to the propensity for narrow or drawn result outcomes respectively as before. In order to express the likelihood, additional notation is required. From now on, the paired ij notation will indicate the ordered pair where i is the home team and j the away team, unless explicitly stated otherwise. Let the frequency of each result outcome be represented as follows: Then define the number of points gained by team i, p i = j 4(r ij 4,0 +r ij 4,1 +r ji 0,4 +r ji 1,4 )+2(r ij 2,2 +r ji 2,2 )+ (r ji 4,1 + r ij 1,4 ) + (t ij 1,1 + t ij 1,0 + t ji 0,1 ), and let n = i j (r ij 4,1 + r ij 1,4 ) be the total number of narrow wins, points scored by home teams and away teams. Then the likelihood can be expressed as where R, T are the information from the result and try outcomes respectively. It is therefore the case that the statistic (p, n, d, b, z, h) is a sufficient statistic for (π, ρ n , ρ d , τ b , τ z , κ). This gives a log-likelihood, up to a constant term, of As the form of the log-likelihood suggests, and following Fienberg (1979) , the estimation of the parameters may be simplified by using a log-linear model. Let θ ijkl denote the observed count for the number of matches with home team i, away team j, result outcome k, and try bonus outcome l. Furthermore let µ ijkl be the expected value corresponding to θ ijkl . The log-linear version of the model can then be written as where θ ij is a normalisation parameter, and θ ijk· and θ ij·l represent those parts due to the result outcome and try outcome respectively. That is if home win by wide margin 4α i + α j + β n + 3η if home win by narrow margin if away win by wide margin The gnm package in R (Turner and Firth, 2020 ) is used to give maximum likelihood estimates for (α, β n , β d , γ b , γ z , η) and thus for our required parameter set (π, ρ n , ρ d , τ b , τ z , κ). An advantage of gnm for this purpose is that it facilitates efficient elimination of the 'nuisance' parameters θ ij that are present in this log-linear representation. If modelling the try bonus dependent on the result outcome then θ ijkl would not be separated into the independent parts θ ijk· and θ ij·l , and θ ijkl would need to be specified for each result-try outcome combination. This would be the case for example in modelling the Top14 tournament. There would be some simplification in that case however, given that, conditional on the result outcome there are only two try bonus outcomes, namely, winning team gains try bonus, and winning team does not gain try bonus. Once the parameters have been estimated, they can be used to compute the outcome probabilities. This allows for a calculation of the projected points per match for team i, PPPM i , by averaging the expected points per match were team i to play each of the other teams in the tournament twice, once at home and once away: where p ij a,b now denotes specifically the probability that i as the home team gains a points and j as the away team gains b points. It may readily be shown that the derivative of PPPM i with respect to π i is strictly positive so a team ranked higher based on strength π i will also be ranked higher based on projected points per match PPPM i and vice versa. Thus PPPM i may be used as an alternative, more intuitive, rating. One potential criticism of the model proposed so far is that it gives no additional credit to a team that has achieved their results against a large number of opponents as compared to a team that has played only a small number. This is an intuitive idea in line with those discussed by Efron and Morris (1977) in the context of shrinkage with respect to strength evaluation in sport. An obvious way to address such a concern in the context of the model considered in this paper is to apply a prior distribution to the team strength parameters. According to Schlobotnik (2018) , this is an idea considered by Butler in the development of the KRACH model. In some scenarios, one might consider applying asymmetric priors based on, for example, previous seasons' results. This may be appropriate if one were seeking to use the model to predict outcomes, for example. Even then, given the large variation in team strength that can exist from one season to the next in, for example, a schools environment, where there is enforced turnover of players, then the use of a strong asymmetric prior may not be advisable. In the context of computing official rankings, it would seem more reasonable as a matter of fairness to instead apply a symmetric prior so that rating is based solely on the current season's results. This may be achieved through the consideration of a dummy 'team 0', against which each team plays two notional matches with binary outcome. From one match they 'win' and gain a point and from the other they 'lose' and gain nothing. Recalling that p i represents the total points gained by team i, this adds the same value to each team's points. The influence of this may then be controlled by weighting this prior. As the prior weight increases, the proportion of p i due to the prior increases. Including a prior has two main advantages in this setting. One is that it ensures that the set of teams is connected so that a ranking may be produced after even a small number of matches. The second is that it provides one method of ensuring that there is a finite mean for the team strength parameters, which in turn enables the reinterpretation of the structural parameters as the more intuitive probabilities that were originally introduced in Section 2.2. In the three scenarios of varying schedule strength highlighted in Section 1 that related to professional club teams, the ranking is unlikely to be sensitive to the choice of prior weight, since at any given point in the season teams are likely to have played a similar number of matches or to have played sufficiently many such that the prior will not be a large factor in discriminating between teams. Indeed when estimating rankings mid-season for a round robin tournament there may be a preference not to include a prior so that the estimation of PPPM i is in line with the actual end-of-season PPPM i , without any adjustments being required. In the context of schools ranking, and the Daily Mail Trophy in particular, this is not the case, with teams playing between five and thirteen matches as part of the tournament in any given season. One could consider selecting the weight of the prior based on how accurately early-season PPPM i using different prior weights predicts end-of-season PPPM i . However there are some practical challenges to this that are discussed further in Section 4.3. Perhaps more fundamentally however the determination of the weight of the prior to be used may be argued to not be a statistical one but rather one of fairness. Its effect is to favour either teams with limited but proportionately better records or teams with longer but proportionately worse records, for example should a 5-0 record (five wins and no losses) be preferred to a 9-1 record against equivalent opposition or a 6-1 record preferred to a 10-2 record? This is a matter for tournament stakeholders and will be discussed further in the context of the Daily Mail Trophy in Section 4. Choosing to constrain the team strength parameters by ascribing a mean strength of one is desirable as it allows for an intuitive meaning to be asserted from the structural parameters in the model. This could be done in a number of ways, two of which are discussed here. One way would be to fit the model with no constraint and afterwards apply a scaling factor to achieve an arithmetic mean of 1. That is let µ be the arithmetic mean of the abilities π i derived from the model Then by setting π i = π i /µ a mean team strength of 1 for the π i is ensured. Alternatively we might motivate an alternative mean by considering the strength of the prior. Consider the projected points per match for a dummy 'team 0' that achieves one 'win' and one 'loss' against each other team in the tournament, as described in section 3.3. If zero points are awarded for a 'loss', and, without loss of generality, one point is awarded for a 'win', and there is assumed to be no home advantage and bonuses, then The strength of team 0, π 0 , may be selected to take any value, since it is not a real participant in the tournament and so it may be set arbitrarily to π 0 = 1. Intuitively since it has an equal winning and losing record against every team one might expect it to be the mean team and therefore have a strength of one. More formally we are setting and so rearranging gives and by defining a generalised mean as the function on the right hand side of this equation the required mean of one for the team strength parameters is returned. While the prior has been used here to give this generalised mean an intuitive interpretation, it may be applied even without choosing to use a prior. As such it could be particularly beneficial in the context of a tournament such as the Daily Mail Trophy, because it is quite possible that a team will have achieved full points and so the estimated team strength parameter π i may be infinite. If this were the case then it would not be possible to achieve a mean of 1 using, for example, an arithmetic mean. This in turn would mean that some of the structural parameters would be undefined also and so one could no longer make the intuitive interpretations around propensity for draws or narrow results based on those structural parameters. The generalised mean defined here, on the other hand, is always finite and is therefore used in the analysis below. The data for the Daily Mail Trophy have been kindly supplied by www.schoolsrugby.co.uk, the organisation that administers the competition. The match results are entered by the schools themselves. The score is entered, and this is used to suggest a number of tries for each team which can then be amended. These inputs are not subject to any formal verification. This might suggest that data quality, especially as it relates to number of tries, may not be reliable. However the league tables are looked at keenly by players, coaches and parents, and corrections made where errors are found, and so data quality, especially at the top end of the table, is thought to be good. This analysis uses results from the three seasons 2015/16 to 2017/18. Over this period there were 24 examples of inconsistencies or incompleteness found in the results that required assumptions to be made. All assumptions were checked with SOCS. Full details of these are given in the Appendix. The results are summarised in Figure 1 . In order to provide a comparison, they are plotted above those for the English Premiership for the same season. In comparison to the English Premiership result outcomes, there is a reduced home advantage and a reduced prevalence of narrow results, though the overall pattern of a higher proportion of wide than narrow results, and a low prevalence of draws is maintained. With respect to the try bonus outcomes, the notable difference is the higher prevalence of both teams gaining a try bonus in the Premiership. In the context of this model, calibration consists of two parts: a determination of the value of the structural parameters (the model parameters not related to a particular team); and a determination of the weight of the prior. One approach to the structural parameters would be to allow them to be determined each season. However, it would seem clear that, in regard to the structural parameters, data from proximate seasons is relevant to an assessment of their value in the current season. For example, one would not really expect the probability of a draw between two equally matched teams to change appreciably from season to season and so data on that should be aggregated across seasons in order to produce a more reliable estimate for the parameters. As can be seen from Table 1 the range for each was not large. It was also found that varying the parameters used within that range did not materially impact ratings under the model. The structural parameters are therefore fixed at the mean of the three seasons' estimated values. An intuitive way to interpret these is by calculating, based on these parameters, the probability of specific outcomes for a match between two teams of mean strength. For example it can then be determined that under the model, in such a match, the probability of a wide result is 65%, of both teams gaining a try bonus only 1%, and perhaps most notably that the home team is 2.2 times as likely to win as the away team. As previously mentioned there is limited scope with the Daily Mail Trophy data to compare the prior weights based on their predictive capabilities, since it is not a round robin format. One could look at an earlier state in the tournament and compare to a later state where more information has become available, but such an approach is limited both by the number of matches that teams play (many play only five in total), by only having three seasons' worth of data on which to base it, and by the fact that even in the later more informed state the estimation of the team strength will be defined by the same model. Therefore no analysis of this kind is performed. As discussed in section 3.3 the main aim of the use of a non-negligible prior is to reasonably account Looking at Figure 2 and comparing to the information in Table 2 it can be seen that as the prior weight is increased that, in general, teams who have played fewer matches move lower, most notably Kingswood, and those who have played more move higher, most notably Sedbergh. This is not uniformly true with, for example, St Peter's moving higher despite having played relatively few matches and having a lower league points per match than either Kingswood or Northampton, who they overtake when prior weight is set to 8. Of course while the general pattern is clear and expected, the question of interest is what absolute size for the prior should be chosen. It seems reasonable to state that a team with a 100% winning record from four matches should not generally be ranked higher than a team with a 100% wining record from eight matches, assuming their schedule strength is not notably different. It certainly seems undesirable that all of the six other teams with 100% winning records below Kingswood should be ranked lower than them, which would imply a prior weight of at least one and more likely 4 or higher. Results for the other two seasons are included in the Appendix. Considerations and comparisons in line with those above were made across the three seasons. A reasonable case could be made for prior weights between 2 and 8, and ultimately it is a decision that should be made by the stakeholders of the tournament with regard to their view on the relative merit of a shorter more perfect record as compared to a longer but less perfect record. For the purposes of further analysis here a prior weight of 4 was chosen. The model may then be used to assess the current ranking method used in the Daily Mail Trophy. As can be seen in Figure 3 there is at least broad agreement between the two measures. However this is not a particularly helpful way to look at the quality of the Daily Mail Trophy method, as this agreement can be ascribed largely to the base scoring rule of league points per match, LPPM, which both methods essentially have in common. What is of more interest is the effectiveness of the adjustment made for schedule strength. This is shown in Figure 4 . Here clear differences can be seen and there is a low correlation between the measures. Not surprisingly, some of the teams who perform well in the Daily Mail Trophy rankings seem to be those that are benefiting most from these differences, with Wellington College in particular, winner of the Daily Mail Trophy in two of the three seasons, being a serial outlier in this regard. While this is concerning in its own right, the requirements on the measure are related almost solely to the ranking that they produce, rather than the rating. Figure 5 looks at that. Here considerable differences are seen between the rankings produced by the two different methods. In order to focus more clearly on this aspect, the difference in ranking under the two methods is plotted against the Daily Mail Trophy rank in Figure 6 . considered then a disproportionately positive (and negative) impact from the Daily Mail Trophy method is seen. What is perhaps more notable is the size of some of these rank differences, up to 28 places in a tournament of approximately one hundred teams. Looked at across the population the mean absolute difference in rank is approximately eight places. Looking at the typical difference in points between two teams eight places apart then this can be observed to be worth approximately 0.4 points per match. It seems reasonable therefore to say that over the general population of teams there is scope for improvement in the Daily Mail Trophy method in its approach to adjusting for schedule strength. Given the nature of a tournament where there is a winner but no relegation then there is a natural focus on the top end of the ranking. Comparisons of the rankings for the top ten teams under the In particular, in 2015/16 Wellington College, who were the winners of the tournament, are ranked seventh under the model and were a full 0.68 projected points per match behind the leader. In its most general application the model presented here allows for a ready extension of the wellknown Bradley-Terry model to a system of pairwise comparisons where each comparison may result in any finite number of scored outcomes. For example, the model could be adapted to a situation where judges are asked to assign pairwise preferences on the seven-category symmetric scale made up of 'strongly prefer ', 'prefer', 'mildly prefer', 'neutral' etc. if one were prepared to assign score values to each. The maximum entropy derivation provides a principled basis for a family of models. The application of entropy maximisation to motivate these models also helps to clarify the various assumptions and considerations that are essential to each. In the more particular implementation for rugby union the family of models provided a method for assessing teams in situations where schedule strengths vary in a way that is consistent with the points norm of the sport. Within that family, different models may be suitable depending on the try bonus stipulations of the tournament, the density of matches, and the similarity of the number of fixtures played across teams. In the investigation of the Daily Mail Trophy the model studied here proved to be a useful tool in highlighting concerns about the ranking method that is currently used. It may be tempting to advocate its use directly as a superior method for evaluating performance in that tournament. However a key element that it lacks for a wider audience is transparency, for example as represented in the ability of stakeholders to calculate their rating, to understand the impact of winning or losing in a particular match, and to evaluate what rating differences between themselves and similarly ranked teams mean in terms of how rankings would change given particular results. The strength of the model in accounting for all results in the rating of each team is, in this sense, also a weakness for wider application. But even if transparency of method is seen as a dominating requirement the model may still be useful as a means by which alternative, more transparent methods can be assessed. 6 Appendix 6.1 Maximum entropy derivations The offensive-defensive strength model assumes independence of the result and try outcomes. The maximum entropy derivation is thus related solely to the try outcome and is linked to the result outcome by the assumption that for each team the overall strength parameter is equal to the product of the offensive and defensive parameters. The same notation may be used as in the general derivation, though in this case a, b ∈ {0, 1}. Entropy is defined as before as and we have the familiar condition that for each pair of teams the sum of the probabilities of all possible outcomes is 1, Then, for all i, j such that m ij > 0, the solution satisfies which gives us that where ω i = exp(−λ i ), δ i = exp(−λ i ), and we take π i = ω i δ i . In order to identify the home team, let the ordered pair ij now denote i as the home team and j as the away team. Then under this amended notation, define entropy as before and we have the familiar condition that for each pair of teams the sum of the probabilities of all possible outcomes is 1, The retrodictive criterion is now altered to reflect the new notation, But now we also have a condition that says that the expected difference between the number of home points and the number of away points is equal to the actual difference, Then, for all i, j such that m ij > 0, the solution satisfies which gives us that where κ = exp(−λ 0 ), π i = exp(−λ i ), and the constant of proportionality is exp(−λ ij − 1). Using the same notation, define entropy in the now familiar way and we have the familiar condition that for each pair of teams the sum of the probabilities of all possible outcomes is 1, The retrodictive criteria are now split into home and away parts, so that we have that, for all teams, the expected number of home points gained is equal to the actual number of home points gained, and that, for all teams, the expected number of away points gained is equal to the actual number of away points gained, Then, for all i, j such that m ij > 0, the solution satisfies where H λ i and A λ j are the Lagrangian multipliers relating to the home and away criteria respectively. This gives where the strength parameters, H π i = exp(− H λ i ) and A π j = exp(− A λ j ) denote the home and away strengths of i and j respectively. While there were no means to validate the data independently, there were 24 occasions of identifiable self-inconsistencies or incompleteness in the data across the three seasons of interest, 15 of which impacted the result or try outcomes for at least one of the teams involved. The treatment of all of these is described below. They were checked for reasonableness with SOCS, the administrator for the tournament. 1. Where the score could not have produced the try outcome. Since a try is worth five points in rugby union, then the score of any team may not be less than five times their number of tries. If swapping the number of tries recorded for home and away teams produced consistency then this was done. If this did not resolve the issue then the number of tries was adjusted down to the maximum number of tries possible given the score. 3. Where matches were entered as a win for one side but score and tries were both given as 0-0. On speaking to SOCS, their speculation was that these may have related to matches where there had been some sort of 'gentleman's agreement' e.g. the teams had agreed to deselect certain players (in particular those with representative honours), and the recording of the match was a means of recognising that a fixture had taken place, but not giving it full status. In our analysis, the winning team is awarded four points for a win, the losing team one for a narrow loss, and no try bonus is awarded to either side. There were two such results in 2017/18, and two in 2016/17. 4. Where the try count was blank for one of the two teams, the number of tries was taken to be the maximum number of tries possible given the score. There was one case of this in 2016/17. Where the result outcome (Won, Draw, Loss) did not agree with the score but did agree with the try outcome, but became consistent if the score were reversed, then the score was reversed. One case in 2017/18. This did not impact the analysis. Currently the ranking is based on Merit Points, which are defined as the average number of League seasons Kingswood School and Stockport Grammar School respectively therefore did not appear in the final Daily Mail Trophy league table. This rule could continue to be used to deal with cases of teams playing low numbers of matches rather than relying on the prior to do the job entirely. On the other hand one can credibly argue that a robust ranking model should be able to deal with all result outcomes without an arbitrary inclusion cut off. It is also reasonable to assert that there is still useful information from these teams for the calibration of the model, whether It is not possible to say that either of these alternative rankings is definitively right in any of these three cases. In all these cases the projected points per match of the two teams remain very similar, and both alternatives would pass the sensible criterion that a ranking method should be such that all other relative rankings should not be perceivable as unreasonable by a large proportion of the tournament stakeholders. Rank analysis of incomplete block designs: I. the method of paired comparisons Pairwise comparison and ranking in tournaments On extending the Bradley-Terry model to accommodate ties in paired comparison experiments On extending the Bradley-Terry model to incorporate within-pair order effects Stein's paradox in statistics Log linear representation for paired comparison models with ties and within-pair order effects Solution of a ranking problem from binary comparisons Constrained Monte Carlo maximum likelihood for dependent data Interpretation of average ranks Information theory and statistical mechanics Majorization, entropy and paired comparisons KRACH Ratings for D1 College Hockey Generalized nonlinear models in R: An overview of the gnm package KRACH Ratings Die berechnung der turnier-ergebnisse als ein maximumproblem der wahrscheinlichkeitsrechnung