**Executive Summary**

- The principal advantage of a statistical approach to player ratings is to ensure that information on performance is used in a consistent way.
- However there are numerous difficulties in using statistical techniques such as regression analysis to estimate the weightings to construct an algorithm for combining performance metrics into a single player rating.
- But research in decision science shows that there is little or no gain in using sophisticated statistical techniques to estimate weightings. Using equal weights works just as well in most cases.
- I recommend a simple approach to player ratings in which performance metrics are standardised using Z-scores and then added together (or subtracted in the case of negative contributions) to yield a player rating that can then be rescaled for presentational purposes.

The basic analytical problem in contributions-based player ratings, particularly in the invasion-territorial team sports, is how to reduce a multivariate set of performance metrics to a single composite index. A purely statistical approach combines the performance metrics using weightings derived from a team-level win-contributions model of the relationship between the performance metrics and match outcomes, with these weightings usually estimated by regression analysis. But, as I have discussed in previous posts, numerous estimation problems arise with win-contributions models so much so that I seriously question whether or not a purely statistical approach to player ratings is viable. Those who have tried to produce player ratings based on win-contributions models in the invasion-territorial team sports have usually ended up adopting a “mixed-methods” approach in which expert judgment plays a significant role in determining how the performance metrics are combined. The resulting player ratings may be more credible but can lack transparency and so have little practical value for decision makers.

Decision science can provide some useful insights to help resolve these problems. In particular there is a large body of research on the relative merits of expert judgment and statistical analysis as the basis for decisions in complex (i.e. multivariate) contexts. The research goes back at least to Paul Meehl’s book, *Clinical versus Statistical Predictions*, published in 1954. Meehl subsequently described it as “my disturbing little book” in which he reviewed 20 studies in a wide range of areas, not just clinical settings, and found that statistical analysis in all cases provided at least as good predictions, and in most cases, more accurate predictions. More than 30 years later Dawes reviewed the research instigated by Meehl’s findings and concluded that “the finding that linear combination is superior to global judgment is strong; it has been replicated in diverse contexts, and no exception has been discovered”. More recently, the Nobel Prize laureate, Daniel Kahneman, in his best-selling book, *Thinking: Fast and Slow*, surveyed around 200 studies and found that 60% showed statistically-based algorithms produced more accurate predictions with the rest of the studies showing algorithms to be as good as experts. There is a remarkable consistency in these research findings unparalleled elsewhere in the social sciences yet the results have been ignored for the most part so that in practice confidence in the superiority of expert judgment remains largely undiminished.

What does this tell us about decision making? Decisions always involve prediction about uncertain future outcomes since we choose a course of action with no certainty over what will actually happen. We know the past but decide the future. We try to recruit players to improve future team performance using information on the player’s current and past performance levels. What decision science has found is that experts are very knowledgeable on the factors that will influence future outcomes but experts, like the rest of us, are no better and indeed are often worse, when it comes to making consistent comparisons between alternatives in a multivariate setting. Decision science shows that human beings tend to be very inconsistent, focusing attention on a small number of specific aspects of one alternative but then often focusing on different specific aspects of another alternative, and so on. Paradoxically experts are particularly prone to inconsistency in the comparison of alternatives because of their depth of knowledge of each alternative. Statistically-based algorithms guarantee consistency. All alternatives are compared used the same metrics and the same weightings. The implication for player ratings is very clear. Use the expert judgment of coaches and scouts to identify the key performance metrics but rely on statistical analysis to construct an algorithm (i.e. a player rating system) to produce consistent comparisons between players.

So far so good but this still does not resolve the statistical estimation problems involved in using regression analysis to determine the weightings to be used. However decision science offers an important insight in this respect as well. Back in the 1970s Dawes undertook a comparison of the predictive accuracy of proper and improper linear models. By a proper linear model he meant a model in which the weights were estimated using statistical methods such as multiple regression. In contrast improper linear models use weightings determined non-statistically such as equal-weights models where it is just assumed that every factor has the same importance. Dawes traces the equal-weights approach back to Benjamin Franklin who adopted a very simple method for deciding between different courses of action. Franklin’s “prudential algebra” was simply to count up the number of reasons for a particular course of action and subtract the number of reasons against, then choose that course of action with the highest net score. It is a very simple but consistent and transparent with a crucial role for expert judgment in identifying the reasons for and against a particular course of action. Using 20,000 simulations, Dawes found that equal weightings performed better than statistically-based weightings (and even randomly generated weightings worked almost as well). The conclusion is that it is consistency that really matters, more so than the particular set of weightings used. And as well as ensuring consistency, an equal-weights approach avoids all statistical estimation problem. Equal weights are also more likely to provide a method of general application that avoids the problem of overfitting i.e. weightings that are very specific to the sample and model formulation.

Applying these insights from decision science to the construction of player rating systems provides the justification for what I call a simple approach to player ratings. There are five steps:

- Identify an appropriate set of performance metrics involving the expert judgment of GMs, sporting directors, coaches and scouts
- Standardise the performance metrics to ensure a common measurement scale – my suggested standardisation is to calculate Z-scores
- Z-scores have been very widely used to standardise performance metrics with very different scales of measurement e.g. Z-scores have been used in golf to convert very different types of metrics such as driving distance (yards), accuracy (%) and number of putts into comparable measures that could be added together.
- Allocate weights of +1 to positive contributions and -1 to negative contributions (i.e. Franklin’s prudential algebra)
- Calculate the total Z-score for every player
- Rescale the total Z-score to make them easier to read and interpret. I usually advise avoiding negative ratings and reducing the dependency on decimal places to differentiate players.

I have applied the simple approach to produce player ratings for 535 outfield players in the English Championship covering the first 22 rounds of games in season 2015/16. I have used player totals for 16 metrics: goals scored, shots at goal, successful passes, unsuccessful passes, successful dribbles, unsuccessful dribbles, successful open-play crosses, unsuccessful open-play crosses, duels won, duels lost, blocks, interceptions, clearances, fouls conceded, yellow cards and red cards. The total Z-score for every player has been rescaled to yield a mean rating of 100 (and a range 5.1 – 234.2). Below I have reported the top 20 players.

Player |
Team |
Player Rating |

Shackell, Jason | Derby County | 234.2 |

Flint, Aden | Bristol City | 197.9 |

Keogh, Richard | Derby County | 196.0 |

Keane, Michael | Burnley | 195.7 |

Morrison, Sean | Cardiff City | 193.8 |

Duffy, Shane | Blackburn Rovers | 191.1 |

Davies, Curtis | Hull City | 184.3 |

Onuoha, Nedum | Queens Park Rangers | 183.2 |

Morrison, Michael | Birmingham City | 179.1 |

Duff, Michael | Burnley | 175.6 |

Hanley, Grant | Blackburn Rovers | 175.2 |

Tarkowski, James | Brentford | 171.1 |

McShane, Paul | Reading | 169.8 |

Collins, Danny | Rotherham United | 168.3 |

Stephens, Dale | Brighton and Hove Albion | 167.4 |

Lees, Tom | Sheffield Wednesday | 166.0 |

Judge, Alan | Brentford | 164.4 |

Blackman, Nick | Reading | 161.9 |

Bamba, Sol | Leeds United | 160.1 |

Dawson, Michael | Hull City | 159.7 |

I hasten to add that these player ratings are not intended to be definitive. As always they are a starting point for an evaluation of the relative merits of players and should always be considered alongside a detailed breakdown of the player rating into the component metrics to identify the specific strengths and weaknesses of individual players. They should also be categorised by playing position and playing time but those are discussions for future posts.

**Some Key Readings in Decision Science**

Meehl, P., *Clinical versus Statistical Predictions: A Theoretical Analysis and Revision of the Literature*, Minneapolis: University of Minnesota Press, 1954.

Dawes, R. M. ‘The robust beauty of improper linear models in decision making’, *American Psychologist*, vol. 34 (1979), pp. 571– 582.

Dawes, R. M., *Rational Choice in an Uncertain World*, San Diego: Harcourt Brace Jovanovich, 1988.

Kahneman, D., *Thinking, Fast and Slow*, London: Penguin Books, 2012.

I have a question about rescaling z-scores. Probably low level question. Do you just add or multiply (difference here is obviously important) all the total z-scores by the same number without losing anything? Also, can this be done to the individual z-scores for each statistic in order to make all the individual statistics positive scores for data visualization reasons?

LikeLike

1. Combining z-scores: once you have calculated the z-scores for the individual components of performance, just sum them together to get the total z-score. This implies giving the same weight to each component of performance. The evidence suggests that there is little to be gained from using a more complicated weighting system especially if you are using a number of individual components.

2. Visualising z-scores: I understand your concern with presenting and visualising raw z-scores as a performance metric. Minus scores often send the wrong message and of course in a normal distributed metric, 95% of the z-scores will be concentrated between +2.0 and -2.0. Hence I would only advise using z-scores for analytical purposes. When it comes to presenting and visualising performance metrics, typically I transform z-scores into a 0 – 100 scale. I do this, for example, in my ratings of team performances in Dutch soccer that I produce for AZ Alkmaar. I set the mean at 50 and then choose an appropriate scaling for the standard deviation to aim for 95%+ of the performance ratings to lie in the interval 20 – 80.

Thank you for your interest in my blog. I hope my reply to your query is helpful.

LikeLiked by 1 person