Small is Beautiful: Big-Data Analytics and the Big-to-Small Translation Problem

Originally Written: January 2018

Happy New Year. And apologies for the lack of posts on Winning With Analytics over the last year. Put it down to my Indiana-Jones-type existence, a university prof by day, and a sports data analyst by night. This duality of roles became even more hectic in 2017 as I returned to rugby union to work again with Brendan Venter now Technical Director at London Irish as well as assisting South Africa and Italy. I have also continued my work with AZ Alkmaar in Dutch football. To some I might seem to be a bit of a dilettante, trying to work simultaneously at an elite level in two different sports. Far from it. Much of the insights on game tactics and analytical methods are very transferable across the two sports. The last 12 months have probably been one of my most productive periods in developing my understanding of how to best use data analytics as part of an evidence-based approach to coaching. I hope to share much of my latest thinking with you over the coming months with regular posts.

Executive Summary
• Data analytics is suffering from a fixation with big-data analytics.
• Big-data analytics can be a very powerful signal-extraction tool to discover regularities in the data.
• But big-data exacerbates the big-to-small translation problem; big-data, context-generic statistical analysis must be translated into practical solutions to small-data (i.e. unique), context-specific decision problems.
• Sports analytics is most effective when the analyst understands the specific operational context of the coach, produces relevant data analysis and translates that analysis into practical recommendations.

The growth in data analytics has been closely associated with the emergence of big data. Originally “big data” referred to those really, really big databases that were so big as to create significant hardware capacity problems and required clusters of computers to work together. But these days the “big” in big data is, much like beauty, in the eye of the beholder. IBM categorise big-data analytics in terms of the four V’s – Volume (scale of data), Velocity (analysis of streaming data), Variety (different forms of data), and Veracity (uncertainty of data). The 4 V’s capture the core problems of big-data analytics – trying to analyse large datasets that are growing exponentially with data captured from multiple sources of varying quality and reliability. I always like to add a fifth V – Value. Big-data analytics must be relevant to the end-user, providing an evidential base to support to the decision-making process.

Sports analytics, just like other applications of data analytics, seems to have been bitten by the big-data bug. In my presentation last November at the 4th Annual Sportdata & Performance Forum held in Zurich, I called it the “big-data analytics fixation”. I don’t work with particularly big datasets, certainly not big in the sense of exceeding the capacity of a reasonably powerful PC or laptop. The basic XML file produced by Opta for a single football match has around 250k data points so that a database covering all matches in a football league for one season contains around 100m data points. This is pretty small compared to some of the datasets used in business analytics but sizeable enough to have totally transformed the type of data analysis I am now able to undertake. But I would argue very strongly that the basic principles of sports analytics remain unchanged irrespective of the size of the dataset with which the analyst is working.

Big-data analytics exacerbates what I call the big-to-small translation problem. Big-data analytics is a very powerful signal-extraction tool to discover regularities in the data. Big-data analytics, like all statistical modelling, attempts to decompose observed data into systematic variation (signal) and random variation (noise). The systematic variation captures the context-generic factors common to all the observations in a dataset while the random variation represents the context-specific factors unique to each individual observation. But while analytical modelling is context-generic, decisions are always unique and context-specific. So it is important to consider both the context-generic signal and the context-specific noise. This is the big-to-small translation problem. Understanding the noise can often be just as important, if not more so, as understanding the signal when making a decision in a specific context. Noise is random variation relative to the dataset as a whole but random does not necessarily mean inexplicable.

I disagree profoundly with the rather grandiose end-of-theory and end-of-statistics claims made for big-data analytics. Chris Anderson in an article on the Wired website back in 2008 claimed that the data deluge was making the scientific method obsolete. He argued that there was no longer any need for theory and models since in the world of big data correlation supersedes causation. Indeed some have argued that big-data analytics represents the end of statistics since statistics is all about trying to make inferences about a population from a sample but big data renders sampling irrelevant when we are now working with population data not small samples. But evidence-based practice always requires an understanding of causation. Recommendations that do not take into account the specific operational context and the underlying behavioural causal processes are unlikely to carry much weight with decision-makers.

There is a growing awareness in sports analytics of the big-to-small translation problem. In fact the acceptance by coaches of data analytics as an important source of evidence to complement video analysis and scouting is crucially dependent on analysts being able to translate the results of their data analysis into context-specific recommendations such as player recruitment targets, game tactics against specific opponents, or training session priorities. It was one of the themes to emerge from the presentations and discussions at the Sportdata & Performance Forum in November 2017 (yet again an excellent and very informative event organised by Edward Abankwa and his team at the Pinnacle Group). As one participant put it so well, “big data is irrelevant unless you can contextualise it”. And in a similar vein, a representative of a company supplying wearable technologies commented that their objective is “making big data personally relevant”. Sports analytics is most effective when the analyst understands the specific operational context of the coach, produces relevant data analysis that provides an appropriate evidential base to support the specific decision, and translates that analysis into practical recommendations to the coach on the best course of action.