Pythagorean Expected Wins Revisited

Executive Summary
-The Pythagorean expected-wins model provides a very simple predictor of team win% in baseball using squared scores
-Predictive accuracy can be improved by slightly adjusting the model
-The Pythagorean expected-wins model should be a key component in the strategic planning of the team roster and evaluating the sporting “bottom line” of any player trade-
-The Pythagorean expected-wins model can be applied to other team sports but typically requires using different powers depending on the average number of scores and the average winning margin
-The Pythagorean expected-wins model can be applied in soccer with appropriate adjustments to allow for tied games but goal difference remains the best simple score-based predictor of league performance
-Ultimately what is transferable from baseball across all pro team sports is not the specifics of the Pythagorean expected-wins model but rather the discipline of projecting the expected performance gains from any significant player recruitment decision

Pythagorean Expected Wins
Of all of Bill James’s many contributions to sabermetrics, probably the best known particularly outside baseball is his notion of Pythagorean win expectation. It is a very simple idea – league performance over a season will be closely associated with total scores made and total scores conceded. James’s innovative insight from his extensive study of baseball data was that the relationship between team win%, runs scored (RS) and runs allowed (RA) followed a power relationship:
Win%=〖RS〗^2/(〖RS〗^2+〖RA〗^2 )

As an example, Pythagorean win expectation applied to the 2014 MLB regular season yields the following results:

Table 1: MLB Regular Season 2014

Blog 18.06 Graphic (Table 1)

Pythagorean win% works pretty well in predicting the actual win% as I have tried to highlight using conditional formatting – the colour coding matches up pretty closely but with some important exceptions discussed below.

Using Expected Wins as a Strategic Tool
There are a couple of practical uses of Pythagorean expected wins. The first and by far the most important is that it provides a key relationship to be considered when planning changes to a team roster. Pythagorean expected wins can be used to project the likely impact on the team win% of a series of player trades. It is a great discipline for GMs and Personnel Directors to formalise exactly what they expect a new recruit to bring to the team. What is the sporting “bottom line” of any trade?
Suppose you were acting as an advisor to the Milwaukee Brewers at the end of the 2014 season. The 2014 Brewers were the epitome of an average team certainly statistically. The MLB team average for total runs scored and allowed that season was 659; the Brewers scored 650 runs and allowed 657 runs, finishing with 82 wins and 80 losses to yield an actual win% of 0.506. The Pythagorean formula would have predicted 80 wins (i.e. Pythagorean win% = 0.495). If you had wanted to transform the 2014 Brewers into a team capable of competing for the World Series, you would have needed to target a regular-season win% of around 0.600 which represents 97 wins. To achieve this level of performance would require an improvement in hitting and pitching of approximately 10% to 715 runs scored and 591 runs allowed.
Win%=〖715〗^2/(〖715〗^2+〖591〗^2 )=0.594=>96 wins (= .594 x 162 games)
Of course an equal 10% improvement in hitting and pitching is just one scenario. If you look at the LA Angels with the best win% in the 2014 regular season at 0.605, this was achieved principally by their hitting strength with 773 runs scored (17.3% better than the MLB average). The Angels allowed 630 runs which was only 4.4% better than the MLB average.
Achieving that magnitude of performance improvement is a tall order for any organisation and would require a strategic plan over a number of seasons involving player trading, draft picks, player development and financial planning. But the key point is that James’s formula helps formalise the task more precisely and provides a means of evaluating how alternative courses of action could contribute to the strategic goal of the organisation.

Benchmarking with Expected Wins: Altitude Effects in Denver, Loss of Form in Oakland
Another application of Pythagorean expected wins is as a useful benchmarking device to identify large anomalies between actual win% and predicted win%. Two such anomalies stand out in the 2014 data. The Colorado Rockies had an actual win% of only 0.407, the 2nd lowest, but their Pythagorean win% was significantly higher at 0.460. This deviation is largely due to the impact of the Rockies playing their home games at altitude in Denver. Games involving the Rockies that season averaged 19.1% more runs than the MLB average.
The other big anomaly that season were the Oakland A’s and shows the effect of an extremely inconsistent season. The A’s had a league-best 59 wins and 36 losses at the All-Stars break (win% = 0.621) and still led the Angels in the AL West in early August (9th Aug: 72 wins, 44 losses, win% = 0.621) but thereafter their season collapsed with the A’s losing 30 of their last 46 games and only scraping into the post-season Wild Card game by winning their final regular-season game. But the A’s had been so good in the first two-thirds of the season that their season totals of runs scored and runs allowed still predicted that they should have had the best win% rather than finishing 10 games behind the Angels, their divisional rivals.

Do Squares Yield The Best Predictor?
James’s Pythagorean expected-win model has stood the test of time as a very useful and accurate predictor but its accuracy can be improved upon by adjusting the Pythagorean parameter (i.e. power used). In the case of the 2014 regular season, predictive accuracy can be improved by using 1.810 (based on minimising total squared deviations). This is in line with various other studies such as Kaplan and Rich in 2017* who found that the best fit in individual seasons 2007 – 2016 varied from a minimum of 1.63 to a maximum of 1.96. (Kaplan and Rich’s method gives a Pythagorean parameter of 1.79 for the 2014 season which is very close to my own findings.) All of which goes to confirm that James’s original insight back in the 1970s still remains a very good approximation to MLB reality 40 years later.

Does the Pythagorean Expected-Wins Model Apply to Other Team Sports?
There have been attempts to apply James’s model to other team sports. Kaplan and Rich in their study report results for 2007 – 2016 for the other Major Leagues. They found that the NFL (American Football) and NHL (ice hockey) also work well using squared scores although the expected-wins model works better in the NFL using a Pythagorean parameter around 2.8. Predictive accuracy in the NBA (basketball) requires a parameter in the range 12 – 14. Kaplan and Rich show that the differences in the Pythagorean parameter across sports depend on the average score per game and the average winning margin.
And what about (association) football? The first complication is that football allows for tied games. It is not unusual for 20% – 25% of games to finish tied. This is further complicated by football’s 3-1-0 points system. I have found that the most useful way to apply the Pythagorean expected-wins model in European football is to treat tied games as “half-wins” for the purpose of calculating team win%. The alternative is to use the points percentage (i.e. total league points as a proportion of the maximum attainable). However typically you find that even with the adjustment for tied games, squared scores do not predict very well at the extremes. Table 2 provides a good example of the problem using the FA Premier League for 2013/14. Using squared goals massively over-predicts the win% of the top three teams and under-predicts the win% of the bottom three.

Table 2: FA Premier League, England, 2013/14

Blog 18.06 Graphic (Table 2)

Predictive accuracy in European football is much improved by using a Pythagorean parameter much closer to unity. Using the method of minimising total squared deviations I find that 1.232 works best for the FA Premier League that season. A Pythagorean parameter close to unity in European football fits with the common finding that goal difference is the best simple score-based predictor of league performance. So there really is no need to complicate things; in football just use goal difference to predict league performance.

The Bottom Line
Ultimately what is transferable from MLB across all team sports is not the specifics of the Pythagorean expected-wins model per se. Rather it is the discipline of projecting the expected performance gains from any significant player recruitment decision. Given the size of the financial commitments involved in the salary, acquisition and development costs of elite players, it is only rational to try to project the expected benefits. Some will argue that sport is different and that the expected impact of a new player cannot be quantified. But as soon as you have signed the contract, you have quantified the value of that player financially irrespective of whether or not you believe sport is different. Costs are costs in sport as in business. Where sport differs is in the resistance of some to subjecting their expectations of performance gains to due diligence. You would expect the financial director to subject other major investment by your organisation to proper due diligence using project appraisal techniques. So why not adopt the same logic to sporting investments? That ultimately was the whole point of the Moneyball story which popularised the strategic possibilities for the Pythagorean expected-wins model.

*Edward H. Kaplan and Candler Rich, ‘Decomposing Pythagoras’, Journal of Quantitative Analysis of Sports, vol. 13, no. 4 (2017), pp. 141- 149.

Excelling in Analytics

Executive Summary
• Effective data analysts need to have great Excel skills; expect 80% of your analytics to be conducted in Excel
• Too often data analysts do not receive enough training on Excel especially at universities, partly because of “software snobbery” and partly because of a lack of appreciation of Excel’s extensive analytical functionality
• The six specific functions in Excel with which every analyst should be very familiar are:
o Pivot Tables
o Custom Sort and Filter
o Conditional Formatting
o Graphics
o Formulas Menu
o Add-Ins

I’ve been asked many times over the years for advice on how to be an effective data analyst. I always stress the importance of having really good relationships with the decision-makers for whom you are working. Always ensure before you start any data analysis that you are clear on how your analysis is going to be used – which decisions is it going to feed into? Analytics after all is analysis with real-world purpose.
The other practical advice I regularly give data analysts is to ensure that you have got great Excel skills. I usually support this with a couple of anecdotes. One is that of an economics graduate with an excellent degree from a well-respected business school who was turned down for a post with a central bank because of a lack of Excel skills. The other is about a meeting I attended that was hosted by the local branch of the OR Society a couple of years ago. Four data analysts were invited to talk about their work experiences. During the Q&A they were asked what they wished that they had done more of at university and all four agreed that they wished they had been taught Excel. They estimated that 80% of their work was done in Excel and only 20% in more specialised software yet all had only been taught how to use specialised statistical software at university. These analysts had gone to different universities and studied different courses but their experiences were common to most data analysts. In part the lack of Excel training at universities is “software snobbery” – it’s sexier to teach more advanced software such as R, SPSS and SAS. But I also think that it is because of a misperception that Excel has limited analytical functionality. Nothing could be wider of the mark. In my own case I use Microsoft Access to store and combine datasets; I use Excel to do the bulk of my data manipulation and analysis; for multiple regression and other advanced statistical techniques I mostly use PC-Give (an econometrics package that I have used for years and which is particularly useful for investigating structural breaks in relationships).
There are many reasons why Excel is so widely used by data analysts. Microsoft Excel and Microsoft Office are so widely available that there are very few organisations that do not operate in a Microsoft environment. Excel is very convenient for basic data preparation. It is easy to import data into Excel, clean it up, transform it as required, and export it to other packages. But Excel has extensive analytical functionality as I will discuss further and can be supplemented by a wide range of add-ins provided by Microsoft and other sources. Excel is particularly suited to analysts who prefer the transparency and flexibility of programming within a spreadsheet environment to meet their specific needs as they arise. If, like me, you prefer to check the output of your programming instantly with a real example then Excel is for you. And, crucially, given that most of your colleagues will have Excel on their PCs, laptops, tablets and mobile devices, it becomes a convenient communication tool. You might find it desirable at times to send your spreadsheets to colleagues. It is quite common for organisations to track their KPIs graphically in dashboards constructed in Excel.
There are six specific functions in Excel with which I think every data analyst should be very familiar and which I use extensively in my own work in sports analytics. The list is by no means comprehensive and I am sure others would consider other Excel functions to be just as important if not more so but these are six Excel functions without which my data analysis for teams would be much more time-consuming and probably much less effective.

1. Pivot Tables
Pivot tables are a very efficient means of combining rows of data and calculating summary statistics. They are particularly useful when you are dealing with large datasets when it could take considerable time to sort the data into sub-groups and then calculate summary statistics for each sub-group. I hate to think of the amount of time I wasted in my early days doing just that in ignorance that the Pivot Table function would do it in seconds. The example below is taken from my season review of the 2014/15 Aviva Premiership. I had a summary row of KPIs for every team in every regular season game. Only 264 rows of data (12 teams x 22 games) but it would have taken several minutes to sort the data for each team and calculate season averages for each team. The Pivot Table below did it all in seconds.

Figure 1: An Example of a Pivot Table

Blog 18.05 Graphic (Pivot Table)

2. Custom Sort and Filter
The Custom Sort and Filter icon in the Home menu is definitely one of my most clicked icons in Excel. The Custom Sort function allows you to order your rows by multiple criteria. I am forever using it to sort team and player performance data which I load as it becomes available but will often want it grouped by team/player and ordered by date. The Filter tool is also very useful. It allows you to select sub-samples of relevant data, again using multiple criteria. Only yesterday I used Filter to extract data on every possession in the Aviva Premiership this season. My full dataset for this season has around 140k rows of data from which I extracted data on 6,550 possessions. It took two clicks to get the possession data using Filter. I copied and pasted it into another spreadsheet and was ready to roll on analysing these possessions. It is also worth remembering that there is a Filter option within Pivot Tables which is particularly useful if you want to analyse different types of sub-samples. Instead of filtering the data to create spreadsheets for each sub-samples and then producing pivot tables, you can work with the initial spreadsheet of data throughout and just filter when you are constructing your pivot tables.

3. Conditional Formatting
Conditional formatting is a very useful facility available in the Home menu to visualise differences in your data using colours, data bars or a variety of icons include traffic lights and flags. I make very extensive use of colour-coding of KPIs using Conditional Formatting. I have included below an extract from a recent analysis of the EFL Championship. It immediately highlights the tactical differences between teams as regards whether or not they play a possession-based style (e.g. Brentford compared to Burton Albion and Cardiff City), as well as highlighting that Birmingham City and Burton Albion rank poorly in some of the critical shooting KPIs.

Figure 2: Colour-Coding KPIs using Conditional Formatting

Blog 18.05 Graphic (Cond Format)

4. Graphics
One of the main complaints that I and other users had with the early versions of Excel was that the graphics facility was limited in its options and not user-friendly. I found that once I had created a graphic that did the job, I would go back and just copy and edit it when I needed that type of graphic again. But as data visualisation has become more and more important Microsoft have massively improved the graphics facility in Excel to generate a wide variety of graphics that are easily edited with pull-down menus. On my last count I found 53 basic graphical templates available in Excel. Excel’s graphic facility now has smart functionality. If you highlight the data to be graphed and then click on the Recommended Chart icon in the Insert menu, Excel will identify the most useful graphic templates for your specific data.

5. Formula Menu
The functionality in the Excel Formulas menu is often not fully appreciated. There is so much more available in Excel’s Function Library beyond the extensive array of data analysis functionality in Math & Trig, Statistical (found in More Functions) and Financial. The functionality in Logical and Text can be very powerful for editing textual data.

Figure 3: Excel’s Function Library

Blog 18.05 Graphic (Formulas)

6. Add-Ins
Excel comes with some optional functionality in the form of add-ins that need to be loaded via the File menu (select Options, then Add-Ins, then Manage Excel Add-Ins, then tick the boxes for the add-ins you want to load) when first used. There are two add-ins that I use frequently – Data Analysis and Solver. The Data Analysis add-in provides a set of statistical macros that provide an efficient one-stop facility to undertake a number of related statistical functions. Options in Data Analysis include ANOVA (both single factor and two factor versions), Correlation, Regression and t Tests (both for paired and independent samples) amongst many others. The Solver add-in is an optimisation facility which is very useful not only for undertaking standard OR problems but also allows you to construct a variety of more advanced statistical methods such as those using Maximum Likelihood e.g. logistic regression. There are also a variety of “unofficial” add-ins available for Excel. I use several add-ins provided by Conrad Carlberg to accompany his brilliant guides to doing analytics in Excel (see below).

So my basic message to data analysts, both those just starting as well as old-hands, is to regularly explore the functionality of Excel. You will often be surprised as to just how much Excel can do for you. And to educators a plea that you ensure that your students get a good introduction to Excel – it is a basic life skill if you are going to work anywhere that uses data especially if you are going to have any responsibility for managing performance.

 

Some Further Reading on Excel for Analytics

There are a series of books by Conrad Carlberg providing excellent introductions to analytics using Excel:

Conrad Carlberg, Statistical Analysis: Microsoft Excel 2013, Que, Indianapolis, IN, 2014.

Conrad Carlberg, Decision Analytics: Microsoft Excel, Que, Indianapolis, IN, 2014.

Conrad Carlberg, Predictive Analytics: Microsoft Excel, Que, Indianapolis, IN, 2013.

For a comprehensive coverage of analytical methods using Excel (with excellent appendices on the basics of Excel and Access), try:

J. D. Camm, J. J. Cochran, M. J. Fry, J. W. Ohlmann, D. R. Anderson, D. J. Sweeney and T. A. Williams, Essentials of Business Analytics (2nd edn), Cengage Learning, Boston, MA, 2017.
Extensive but expensive!