Analytics and Context

Executive Summary

  • Context is crucial in data analytics because the purpose of data analytics is always practical to improve future performance
  • The context of a decision is the totality of the conditions that constitute the circumstances of the specific decision
  • The three key characteristics of the context of human behaviour in a social setting are (i) uniqueness; (ii) “infinitiveness”; and (iii) uncertainty
  • There are five inter-related implications for data analysts if they accept the critical importance of context:

Implication 1: The need to recognise that datasets and analytical models are always human-created “realisations” of the real world.

Implication 2: All datasets and analytical models are de-contextualised abstractions.

Implication 3: Data analytics should seek to generalise from a sample rather than testing the validity of universal hypotheses.

Implication 4: Given that every observation in a dataset is unique in its context, it is vital that exploratory data analysis investigates whether or not a dataset fulfils the similarity and variability requirements for valid analytical investigation.

Implication 5: It is misleading to consider analytical models as comprising dependent and independent variable

As discussed in a previous post, “What is data analytics?” (11th Sept 2023), data analytics is best defined as data analysis for practical purpose. The role of data analytics is to use data analysis to provide an evidential basis for managers to make evidence-based decisions on the most effective intervention to improve performance. Academics do not typically do data analytics since they are mostly using empirical analysis to pursue disciplinary, not practical, purposes. As soon as you move from disciplinary purpose to practical purpose, then context becomes crucial. In this post I want to explore the implications for data analytics of the importance of context.

              The principal role of management is to maintain and improve the performance levels of the people and resources for which they are responsible. Managers are constantly making decisions on how to intervene and take action to improve performance. To be effective, these decisions must be appropriate given the specific circumstances that prevail. This is what I call the “context” of the decision – the totality of the conditions that constitute the circumstances of the specific decision.

              In the case of human behaviour in a social setting, there are three key characteristics of the context:

  1.   Unique

Every context is unique. As Heraclitus famously remarked, “You can never step into the same river twice”. You as an individual will have changed by the time that you next step into the river, and the river itself will also have changed – you will not be stepping into the same water in the exactly the same place. So too with any decision context; however similar to previous decision contexts, there will some unique features including of course that the decision-maker will have experience of the decision from the previous occasion. In life, change is the only constant. From this perspective, there can never be universality in the sense of prescriptions on what to do for any particular type of decision irrespective of the specifics of the particular context. A decision is always context-specific and the context is always unique. 

2. “Infinitive”

By “infinitive” I mean that there are an infinite number of possible aspects of any given decision situation. There is no definitive set of descriptors that can capture fully the totality of the context of a specific decision.

3. Uncertainty

All human behaviour occurs in the context of uncertainty. We can never fully understand the past which will always remain contestable to some extent with the possibility of alternative explanations and interpretations. And we can never know in advance the full consequences of our decisions and actions because the future is unknowable. Treating the past and future as certain or probabilistic disguises but does not remove uncertainty. Human knowledge is always partial and fallible

              Much of the failings of data analytics derive from ignoring the uniqueness, “infinitiveness” and uncertainty of decision situations. I often describe it as the “Masters of the Universe” syndrome – the belief that because you know the numbers, you know with certainty, almost bordering on arrogance, what needs be done and all will be well with world if only managers would do what the analysts tell them to do. This lack of humility on the part of analysts puts managers offside and typically leads to analytics being ignored. Managers are experts in context. Their experience has given them an understanding, often intuitive, of the impact of context. Analysts should respect this knowledge and tap into it. Ultimately the problem lies in treating social human beings who learn from experience as if they behave in a very deterministic manner similar to molecules. The methods that have been so successful in generating knowledge in the natural sciences are not easily transferable to the realm of human behaviour. Economics has sought to emulate the natural sciences in adopting a scientific approach to the empirical testing of economic theory. This has had an enormous impact, sometimes detrimental, on the mindset of data analysts given that a significant number of data analysts have a background in economics and econometrics (i.e. the application of statistical analysis to study of economic data).

              So what are the implications if we as data analysts accept the critical importance of context? I would argue there are five inter-related implications:

Implication 1: The need to recognise that datasets and analytical models are always human-created “realisations” of the real world.

The “infinitiveness” of the decision context implies that datasets and analytical models are always partial and selective. There are no objective facts as such. Indeed the Latin root of the word “fact” is facere (“to make”). Facts are made. We frame the world, categorise it and measure it. Artists have always recognised that their art is a human interpretation of the world. The French impressionist painter, Paul Cezanne, described his paintings as “realisations” of the world. Scientists have tended to designate their models of the world as objective which tends to obscure their interpretive nature. Scientists interpret the world just as artists do, albeit with very different tools and techniques. Datasets and analytical models are the realisations of the world by data analysts.

Implication 2: All datasets and analytical models are de-contextualised abstractions.

As realisations, datasets and analytical models are necessarily selective, capturing only part of the decision situation. As such they are always abstractions from reality. The observations recorded in a dataset are de-contextualised in the sense that they are abstracted from the totality of the decision context.

Implication 3: Data analytics should seek to generalise from a sample rather that testing the validity of universal hypotheses.

There are no universal truths valid across all contexts. The disciplinary mindset of economics is quite the opposite. Economic behaviour is modelled as constrained optimisation by rational economic agents. Theoretical results are derived formally by mathematical analysis and their validity in specific contexts investigated empirically, in much the same way as natural science uses theory to hypothesise outcomes in laboratory experiments. Recognising the unique, “infinitive” and uncertain nature of the decision context leads to a very different mindset, one based on intellectual humility and the fallibility of human knowledge. We try to generalise from similar previous contexts to unknown, yet to occur, future contexts. These generalisations are, by their very nature, uncertain and fallible.

Implication 4: Given that every observation in a dataset is unique in its context, it is vital that exploratory data analysis investigates whether or not a dataset fulfils the similarity and variability requirements for valid analytical investigation.

Every observation in a dataset is an abstraction from a unique decision context. One of the critical roles of the Exploration stage of the analytics process is to ensure that the decision contexts of each observation are sufficiently similar to be treated as a single collective (i.e. sample) to be analysed. The other side of the coin is checking the variability. There needs to be enough variability between the decision contexts so that the analyst can investigate which aspects of variability in the decision contexts are associated with the variability in the observed outcomes. But if the variability is excessive, this may call into question the degree of similarity and whether or not it is valid to assume that all of the observations have been generated by the same general behaviour process. Excessive variability (e.g. outliers) may represent different behavioural processes, requiring the dataset to be analysed as a set of sub-samples rather than as a single sample.

Implication 5: It can be misleading to consider analytical models as comprising dependent and independent variables.

Analytical models are typically described in statistics and econometrics as consisting of dependent and independent variables. This embodies a rather mechanistic view of the world in which the variation of observed outcomes (i.e. the dependent variable) is to be explained by the variation in the different aspects of the behavioural process as measured (or categorised) by the independent variables. But in reality these independent variables are never completely independent of each other. They share information (often known as “commonality”) to the extent that for each observation the so-called independent variables are extracted from the same context. I prefer to think of the variables in a dataset as situational variables – they attempt to capture the most relevant aspects of the unique real-world situations from which the data has been extracted but with no assumption that they are independent; indeed quite the opposite. And, given the specific practical purpose of the particular analytics project, one or more of these situational variables will be designated as outcome variables.

Read Other Related Posts

What is Data Analytics? 11th Sept 2023

The Six Stages of the Analytics Process, 20th Sept 2023

The Problem with Outliers

Executive Summary

  • Outliers are unusually extreme observations that can potentially cause two problems:
    1. Invalidating the homogeneity assumption that all of the observations have been generated by the same behavioural processes; and
    2. Unduly influencing any estimated model of the performance outcomes
  • A crucial role of exploratory data analysis is to identify possible outliers (i.e. anomaly detection) to inform the modelling process
  • Three useful techniques for identifying outliers are exploratory data visualisation, descriptive statistics and Marsh & Elliott outlier thresholds
  • It is good practice to report estimated models including and excluding the outliers in order to understand their impact on the results

A key function of the Exploratory stage of the analytics process is to understand the distributional properties of the dataset to be analysed. Part of the exploratory data analysis is to ensure that the dataset meets both the similarity and variability requirements. There must be sufficient similarity in the data to make it valid to treat the dataset as homogeneous with all of the observed outcomes being generated by the same behavioural processes (i.e. structural stability). But there must also be enough variability in the dataset both in the performance outcomes and the situational variables potentially associated with the outcomes so that relationships between changes in the situational variables and changes in performance outcomes can be modelled and investigated.

Outliers are unusually extreme observations that call into question the homogeneity assumption as well as potentially having an undue influence on any estimated model. It may be that the outliers are just extreme values generated by the same underlying behavioural processes as the rest of the dataset. In this case the homogeneity assumption is valid and the outliers will not bias the estimated models of the performance outcomes. However, the outliers may be the result of very different behavioural processes, invalidating the homogeneity assumption and rendering the estimated results of limited value for actionable insights. The problem with outliers is that we just do not know whether or not the homogeneity assumption is invalidated. So it is crucial that the exploratory data analysis identifies possible outliers (what is often referred to as “anomaly detection”) to inform the modelling strategy.

The problem with outliers is illustrated graphically below. Case 1 is the baseline with no outliers. Note that the impact (i.e. slope) coefficient of the line of best fit is 1.657 and the goodness of fit is 62.9%.

Case 2 is what I have called “homogeneous outliers” in which a group of 8 observations have been included that have unusually high values but have been generated by the same behavioural process as the baseline observations. In other words, there is structural stability across the whole dataset and hence it is legitimate to estimate a single line of best fit. Note that the inclusion of the outliers slightly increases the estimated impact coefficient to 1.966  but the goodness of fit increases substantially to 99.6%, reflecting the massive increase in the variance of the observations “explained” by the regression line.

Case 3 is that of “heterogeneous outliers” in which the baseline dataset has now been expanded to include a group of 8 outliers generated by a very different behavioural process. The homogeneity assumption is no longer valid so it is inappropriate to model the dataset with a single line of best fit. If we do so, then we find that the outliers have an undue influence with the impact coefficient now estimated to be 5.279, more than double the size of the estimated impact coefficient for the baseline dataset excluding the outliers. Note that there is a slight decline in the goodness of fit to 97.8% in Case 3 compared to Case 2, partly due to the greater variability of the outliers as well as the slightly poorer fit for the baseline observations of the estimated regression line.

Of course, in this artificially generated example, it is known from the outset that the outliers have been generated by the same behavioural process as the baseline dataset in Case 2 but not in Case 3. The problem we face in real-world situations is that we do not know if we are dealing with Case 2-type outliers or Case 3-type outliers. We need to explore the dataset to determine which is more likely in any given situation.

There are a number of very simple techniques that can be used to identify possible outliers. Three of the most useful are:

  1. Exploratory data visualisation
  2. Summary statistics
  3. Marsh & Elliott outlier thresholds

1.Exploratory data visualisation

Histograms and scatterplots as always should be the first step in any exploratory data analysis to “eyeball” the data and get a sense of the distributional properties of the data and the pairwise relationships between all of the measured variables.

2.Summary statistics

Descriptive statistics provide a formalised summary of the distributional properties of variables. Outliers at one tail of the distribution will produce skewness that will result n a gap between the mean and median. If there are outliers in the upper tail, this will tend to inflate the mean relative to the median (and the reverse if the outliers are in the lower tail). It is also useful to compare the relative dispersion of the variables. I always include the coefficient of variation (CoV) in the reported descriptive statistics.

CoV = Standard Deviation/Mean

CoV uses the mean to standardise the standard deviation for differences in measurement scales so that the dispersion of variables can be compared on a common basis. Outliers in any particular variable will tend to increase CoV relative to other variables.

3. Marsh & Elliott outlier thresholds

Marsh & Elliott define outliers as any observation that lies more than 150% of the interquartile range beyond either the first quartile (Q1) or the third quartile (Q3).

Lower outlier threshold: Q­1 – [1.5(Q3 – Q1)]

Upper outlier threshold: Q­3 + [1.5(Q3 – Q1)]

I have found these thresholds to be useful rules of thumb to identify possible outliers.

Another very useful technique for identifying outliers is cluster analysis which will be the subject of a later post.

So what should you do if the exploratory data analysis indicates the possibility of outliers in your dataset? As the artificial example illustrated, outliers (just like multicollinearity) need not necessarily create a problem for modelling a dataset. The key point is that exploratory data analysis should alert you to the possibility of problems so that you are aware that you may need to take remedial actions when investigating the multivariate relationships between outcome and situational variables at the Modelling stage. It is good practice to report estimated models including and excluding the outliers in order to understand their impact on the results. If there appears to be a sizeable difference in one or more of the estimated coefficients when the outliers are included/excluded, then you should formally test for structural instability using F-tests (often called Chow tests). Testing for structural stability in both cross-sectional and longitudinal/time-series data will be discussed in more detail in a future post. Some argue to drop outliers from the dataset but personally I am loathe to discard any data which may contain useful information. Knowing the impact of the outliers on the estimated coefficients can be useful information and, indeed, it may be that further investigation into the specific conditions of the outliers could prove to be of real practical value.

The two main takeaway points are that (1) a key component of exploratory data analysis should always be checking for the possibility of outliers; and (2) if there are outliers in the dataset, ensure that you investigate their impact on the estimated models you report. You must avoid providing actionable insights that have been unduly influenced by outliers that are not representative of the actual situation with which you are dealing.

Read Other Related Posts

The Reep Fallacy

Executive Summary

  • Charles Reep was the pioneer of soccer analytics, using statistical analysis to support the effectiveness of the long-ball game
  • Reep’s principal finding was that most goals are scored from passing sequences with fewer than five passes
  • Hughes and Franks have shown that Reep’s interpretation of the relationship between the length of passing sequences and goals scored is flawed – the “Reep fallacy” of analysing only successful outcomes
  • Reep’s legacy for soccer analytics is mixed; partly negative because of its association with a formulaic approach to tactics but also positive legacy in developing a notational system, demonstrating the possibilities for statistical analysis football and having a significant impact on practitioners

There have been long-standing “artisan-vs-artist” debates over how the “the beautiful game” (i.e. football/soccer) should be played. In his history of tactics in football, Wilson (Inverting the Pyramid, 2008) characterised tactical debates as involving two interlinked tensions – aesthetics vs results and technique vs physique. Tactical debates in football have often focused on the relative merits of direct play and possession play. And the early developments in soccer analytics pioneered by Charles Reep were closely aligned with support for direct play (i.e. “the long-ball game”).

Charles Reep (1904 – 2002) trained as an accountant and joined the RAF, reaching the rank of Wing Commander. He said that his interest in football tactics began after attending a talk in 1933 by Arsenal’s captain, Charlie Jones. Reep developed his own notational system for football in the early 1950s. His first direct involvement with a football club was as part-time advisor to Brentford in spring 1951, helping them to avoid relegation from Division 1. (And, of course, these days Brentford are still pioneering the use of data analytics to thrive in the English Premier League on a relatively small budget.) Reep’s key finding was that most goals are scored from fewer than three passes. His work subsequently attracted the interest of Stan Cullis, manager in the 1950s of a very successful Wolves team. Reep published a paper (jointly authored with Benjamin) on the statistical analysis of passing and goals scored in 1968. He analysed nearly 2,500 games during his lifetime.

In their 1968 paper, Reep and Benjamin analysed 578 matches, mainly in Football League Division 1 and World Cup Finals between 1953 and 1967. They reported five key findings:

  • 91.5% of passing sequences have 3 completed passes or less
  • 50% of goals come from moves starting in the shooting area
  • 50% of shooting-area origin attacks come from regained possessions
  • 50% of goals conceded come from own-half breakdowns
  • On average, one goal is scored for every 10 shots at goal

Reep published another paper in 1971 on the relationship between shots, goals and passing sequences that excluded shots and goals that were not generated from a passing sequence. These results confirmed his earlier analysis with passing sequences of 1 – 4 passes accounted for 87.6% of shots and 87.0% of goals scored. The tactical implications of Reep’s analysis seemed very clear – direct play with few passes is the most efficient way of scoring goals. Reep’s analysis was very influential. It was taken up by Charles Hughes, FA Director of Coaching and Education, who later conducted similar data analysis to that of Reep with similar results (but never acknowledged his intellectual debt to Reep). On the basis of his analysis, Hughes advocated sustained direct play to create an increased number of shooting opportunities.

Reep’s analysis was re-examined by two leading professors of performance analysis, Mike Hughes and Ian Franks, in a paper published in 2005. Hughes and Franks analysed 116 matches from the 1990 and 1994 World Cup Finals. They accepted Reep’s findings that around 80% of goals scored result from passing sequences of three passes or less. However, they disagreed with Reep’s interpretation of this empirical regularity as support for the efficacy of a direct style of play. They argued that it is important to take account of the frequency of different lengths of passing sequences as well as the frequency of goals scored from different lengths of passing sequences. Quite simply, since most passing sequences have fewer than five passes, it is no surprise that most goals are scored from passing sequences with fewer than five passes. I call this the “Reep fallacy” of only considering successful outcomes and ignoring unsuccessful outcomes. It is surprising how often in different walks of life people commit a similar fallacy by drawing conclusions from evidence of successful outcomes while ignoring the evidence of unsuccessful outcomes. Common sense should tell us that there is a real possibility of biased conclusions when you consider only biased evidence. Indeed Hughes and Franks found a tendency for scoring rates to increase as passing sequences get longer with the highest scoring rate (measured as goals per 1,000 possessions) occurring in passing sequences with six passes. Hughes and Franks also found that longer passing sequences (i.e. possession play) tend to produce more shots at goal but conversion rates (shots-goals ratio) are better for shorter passing sequences (i.e. direct play). However, the more successful teams are better able to retain possession with more longer passing sequences and better-than-average conversion rates.

Reep remains a controversial figure in tactical analysis because of his advocacy of long-ball tactics. His interpretation of the relationship between the length of passing sequences and goals scored has been shown to be flawed, what I call the Reep fallacy of analysing only successful outcomes. Reep’s legacy to sports analytics is partly negative because of its association with a very formulaic approach to tactics. But Reep’s legacy is also positive. He was the first to develop a notational system for football and to demonstrate the possibilities for statistical analysis in football. And, crucially, Reep showed how analytics could be successfully employed by teams to improve sporting performance.

Competing on Analytics

Executive Summary

  • Tom Davenport, the management guru on data analytics, defines analytics competitors as organisations committed to quantitative, fact-based analysis
  • Davenport identifies five stages in becoming an analytical competitor: Stage 1: Analytically impaired Stage 2: Localised analytics Stage 3: Analytical aspirations Stage 4: Analytical companies Stage 5: Analytical competitors
  • In Competing on Analytics: The New Science of Winning, Davenport and Harris identify four pillars of analytical competition: distinctive capability; enterprise-wide analytics; senior management commitment; and large-scale ambition
  • The initial actionable insight that data analytics can help diagnose why an organisation is currently underperforming and prescribe how its future performance can be improved is the starting point of the analytical journey

Over the last 20 years, probably the leading guru on the management of data analytics in organisations has been Tom Davenport. He came to prominence with his article “Competing on Analytics” (Harvard Business Review, 2006) followed up in 2007 by the book, Competing on Analytics: The New Science of Winning (co-authored with Jeanne Harris). Davenport’s initial study focused on 32 organisations that had committed to quantitative, fact-based analysis, 11 of which he designated as “full-bore analytics competitors”. He identified three key attributes of analytics competitors:

  • Widespread use of modelling and optimisation
  • An enterprise approach
  • Senior executive advocates

Davenport found that analytics competitors had four sources of strength – the right focus, the right culture, the right people and the right technology. In the book, he distilled these characteristics of analytic competitors into the four pillars of analytical competition:

  • Distinctive capability
  • Enterprise-wide analytics
    • Senior management commitment
  • Large-scale ambition

Davenport identifies five stages in becoming an analytical competitor:

  • Stage 1: Analytically impaired
  • Stage 2: Localised analytics
  • Stage 3: Analytical aspirations
  • Stage 4: Analytical companies
  • Stage 5: Analytical competitors

Davenport’s five stages of analytical competition

Stage 1: Analytically Impaired

At Stage 1 organisations make negligible use of data analytics. They are not guided by any performance metrics and are essentially “flying blind”. What data they have are poor quality, poorly defined and unintegrated. Their analytical journey starts with the question of what is happening in their organisation that provides the driver to get more accurate data to improve their operations. At this stage, the organisational culture is “knowledge-allergic” with decisions driven more by gut-feeling and past experience rather than evidence.

Stage 2: Localised Analytics

Stage 2 sees analytics being pioneered in organisations by isolated individuals concerned with improving performance in those local aspects of the organisation’s operations with which they are most involved. There is no alignment of these initial analytics projects with overall organisational performance. The analysts start to produce actionable insights that are successful in improving performance. These local successes begin to attract attention elsewhere in the organisation. Data silos emerge with individuals creating datasets for specific activities and stored in spreadsheets. There is no senior leadership recognition at this stage of the potential organisation-wide gains from analytics.

Stage 3: Analytical Aspirations

Stage 3 in many ways marks the “big leap forward” with organisations beginning to recognise at a senior leadership level that there are big gains to be made from employing analytics across all of the organisation’s operations. But there is considerable resistance from managers with no analytics skills and experience who see their position as threatened. With some senior leadership support there is an effort to create more integrated data systems and analytics processes. Moves begin towards a centralised data warehouse managed by data engineers.

Stage 4: Analytical Companies

By Stage 4 organisations are establishing a fact-based culture with broad senior leadership support. The value of data analytics in these organisations is now generally accepted. Analytics processes are becoming embedded in everyday operations and seen as an essential part of “how we do things around here”. Specialist teams of data analysts are being recruited and managers are becoming familiar with how to utilise the results of analytics to support their decision making. There is a clear strategy on the collection and storage of high-quality data centrally with clear data governance principles in place.

Stage 5: Analytical Competitors

At Stage 5 organisations are now what Davenport calls “full-bore analytical competitors” using analytics not only to improve current performance of all of the organisation’s operations but also to identify new opportunities to create new sustainable competitive advantages. Analytics is seen as a primary driver of organisational performance and value. The organisational culture is fact-based and committed to using analytics to test and develop new ways of doing things.

To quote an old Chinese proverb, “a thousand-mile journey starts with a single step”. The analytics journey for any organisation starts with an awareness that the organisation is underperforming and data analytics has an important role in facilitating an improvement in organisational performance. The initial actionable insight that data analytics can help diagnose why an organisation is currently underperforming and prescribe how its performance can be improved in the future is the starting point of the analytical journey.

Read Other Related Posts

The Keys to Success in Data Analytics

Executive Summary

  • Data analytics is a very useful servant but a poor leader
  • There are seven keys to using data analytics effectively in any organisation:
  1. A culture of evidence-based practice
  2. Leadership buy-in
  3. Decision-driven analysis
  4. Recognition of analytics as a source of marginal gains
  5. Realisation that analytics is more than reporting outcomes
  6. Soft skills are crucial
  7. Integration of data silos
  • Effective analysts are not just good statisticians
  • Analysts must be able to engage with decision-makers and “speak their language”

Earlier this year, I gave a presentation to a group of data analysts in a large organisation. My remit was to discuss how data analytics can be used to enhance performance. They were particularly interested in the insights I had gained from my own experience both in business (my career started as an analyst in the Unilever’s Economics Department in the mid-80s) and in elite team sports. I started off with my basic philosophy that “data analytics is a very useful servant but a poor leader” and then summarised the lessons I had learnt as seven keys to success in data analytics. Here are those seven keys to success.

1.A culture of evidence-based practice

Data analytics can only be effective in organisations committed to evidence-based practice. Using evidence to inform management decisions to enhance performance must be part of the corporate culture, the organisation’s way of doing things. The culture must be a process culture by which I mean a deep commitment to doing things the right way. In a world of uncertainty we can never be sure that what we do will lead to the future outcomes we want and expect. We can never fully control future outcomes. Getting the process right in the sense of using data analytics to make the effective use of all the available evidence will maximise the likelihood of an organisation achieving better performance outcomes.

2. Leadership buy-in

A culture of evidence-based practice can only thrive when supported and encouraged by the organisation’s leadership. A “don’t do as I do, do as I say” approach seldom works. Leaders must lead by example and continually demonstrate and extol the virtues of evidence-based practice. If a leader adopts the attitude that “I don’t need to know the numbers to know what the right thing is to do” then this scepticism about the usefulness of data analytics will spread throughout the organisation and fatally undermine the analytics function.

3. Decision-driven analysis

Data analytics is data analysis for practical purpose. The purpose of management one way or another is to improve performance. Every data analytics project must start with the basic question “what managerial decision will be impacted by the data analysis?”. The answer to the question gives the analytics project its direction and ensures its relevance. The analyst’s function is not to find out things that they think would be interesting to know but rather things that the manager needs to know to improve performance.

4. Recognition of analytics as a source of marginal gains

The marginal gains philosophy, which emerged in elite cycling, is the idea that making a large improvement in performance is often achieved as the cumulative effect of lots of small changes. The overall performance of an organisation involves a myriad of decisions and actions. Data analytics can provide a structured approach to analysing organisational performance, decomposing it into its constituent micro components, benchmarking these micro performances against past performance levels and the performance levels of other similar entities, and identifying the performance drivers. Continually searching for marginal gains fosters a culture of wanting to do better and prevents organisational complacency.

5. Realisation that analytics is more that reporting outcomes

In some organisations data analytics is considered mainly as a monitoring process, tasked with tracking key performance indicators (KPIs) and reporting outcomes often visually with performance dashboards. This is an important function in any organisation but data analytics is much more than just monitoring performance. Data analytics should be diagnostic, investigating fluctuations in performance and providing actionable insights on possible managerial interventions to improve performance.

6. Soft skills are crucial

Effective analysts must have the “hard” skills of being good statisticians, able to apply appropriate analytical techniques correctly. But crucially effective analysts must also have the “soft” skills of being able to engage with managers and speak their language. Analysts must understand the managerial decisions that they are expected to inform, and they must be able to tap into the detailed knowledge of managers. Analysts must avoid being seen as the “Masters of the Universe”. They must respect the managers, work for them and work with them. Analysts should be humble. They must know what they bring to the table (i.e. the ability to forensically explore data) and what they don’t (i.e. experience and expertise of the specific decision context). Effective analytics is always a team effort.

7. Integration of data silos

Last but not least, once data analytics has progressed in an organisation beyond a few individuals working in isolation and storing the data they need in their own spreadsheets, there needs to be a centralised data warehouse managed by experts in data management. Integrating data silos opens up new possibilities for insights. This is a crucial part of an organisation developing the capabilities of an “analytical competitor” which I will explore in my next Methods post.

Read Other Related Posts

Moneyball: Twenty Years On – Part Three

Executive Summary

  • Moneyball is principally a baseball story of using data analytics to support player recruitment
  • But the message is much more general on how to use data analytics as an evidence-based approach to managing sporting performance as part of a David strategy to compete effectively against teams with much greater economic power
  • The last twenty years have seen the generalisation of Moneyball both in its transferability to other team sports and its applicability beyond player recruitment to all other aspects of the coaching function particularly tactical analysis
  • There are two key requirements for the effective use of data analytics to manage sporting performance: (1) there must be buy-in to the usefulness of data analytics at all levels; and (2) the analyst must be able to understand the coaching problem from the perspective of the coaches, translate that into an analytical problem, and then translate the results of the data analysis into actionable insights for the coaches

Moneyball is principally a baseball story of using data analytics to support player recruitment. But the message is much more general on how to use data analytics as an evidence-based approach to managing sporting performance as part of a David strategy to compete effectively against teams with much greater economic power. My interest has been in generalising Moneyball both in its transferability to other team sports and its applicability beyond player recruitment to all other aspects of the coaching function particularly tactical analysis.

              The most obvious transferability of Moneyball is to other striking-and-fielding sports, particularly cricket. And indeed cricket is experiencing an analytics revolution akin to that in baseball stimulated in part by the explosive growth of the T20 format in the last 20 years especially the formation of the Indian Premier League (IPL). Intriguingly, Billy Beane himself is now involved with the Rajasthan Royals in the IPL. Cricket analytics is an area in which I am now taking an active interest and on which I intend to post regularly in the coming months after my visit to the Jio Institute in Mumbai.

              My primary interest in the transferability and applicability of Moneyball has been with what I call the “invasion-territorial” team sports that in one way or another seek to emulate the battlefield where the aim is to invade enemy territory to score by crossing a defended line or getting the ball into a defended net. The various codes of football – soccer, rugby, gridiron and Aussie Rules – as well as basketball and hockey are all invasion-territorial team sports. (Note: hereafter I will use “football” to refer to “soccer” and add the appropriate additional descriptor when discussing other codes of football.) Unlike the striking-and-fielding sports where the essence of the sport is the one-on-one contest between the batter and pitcher/bowler, the invasion-territorial team sports involve the tactical coordination of players undertaking a multitude of different skills. So whereas the initial sabermetric revolution at its core was the search for better batting and pitching metrics, in the invasion-territorial team sports the starting point is to develop an appropriate analytical model to capture the complex structure of the tactical contest involving multiple players and multiple skills. The focus is on multivariate player and team performance rating systems. And that requires detailed data on on-the-field performance in these sports that only became available from the late 1990s onwards.

              When I started to model the transfer values of football players in the mid-90s, the only generally available performance metrics were appearances, scoring and disciplinary records. These worked pretty well in capturing the performance drivers of player valuations and the statistical models achieved goodness of fit of around 80%. I was only able to start developing a player and team performance rating system for football in the early 2000s after Opta published yearbooks covering the English Premier League (EPL) with season totals for over 30 metrics for every player who had appeared in the EPL in the four seasons, 1998/99 – 2001/02. It was this work that I was presenting at the University of Michigan in September 2003 when I first read Moneyball.

              My player valuation work had got me into the boardrooms and I had used the same basic approach to develop a wage benchmarking system for the Scottish Premier League. But getting into the inner sanctum of the football operation in clubs proved much more difficult. My first success was to be invited to an away day for the coaching and support staff at Bolton Wanderers in October 2004 where I gave a presentation on the implications of Moneyball for football. Bolton under their head coach Sam Allardyce had developed their own David strategy – a holistic approach to player management based on extensive use of sport science. I proposed an e-screening system of players as a first stage of the scouting process to allow a more targeted approach to the allocation of Bolton’s scarce scouting resources. Pleasingly, Bolton’s Performance Director thought it was a great concept; disappointingly he wanted it to be done internally. It was a story repeated several times with both EPL teams and sport data providers – interest in the ideas but no real engagement. I was asked to provide tactical analysis for one club on the reasons behind the decline in their away performances but I wasn’t invited to present and participate in the discussion of my findings. I was emailed later that my report had generated a useful discussion but I needed more specific feedback to be able to develop the work. It was a similar story with another EPL club interested in developing their player rating system. Again the intermediaries presented my findings and the feedback was positive on the concept but then set out the limitations which I had listed in my report, all related to the need to use more detailed data than that with which I had been provided. Analytics can only be effective when there is meaningful engagement between the analyst and the decision-maker.

              The breakthrough in football came from a totally unexpected source – Billy Beane himself. Billy had developed a passion for football (soccer) and the Oakland A’s ownership group had acquired the Earthquakes franchise in Major League Soccer (MLS). Billy had found out about my work in football via an Australian professor at Stanford, George Foster, a passionate follower of sport particularly rugby league. Billy invited me to visit Oakland and we struck up a friendship that lasts to this day. As an owner of a MLS franchise, Oakland had access to performance data on every MLS game and, to cut a long story short, Billy wanted to see if the Moneyball concept could be transferred to football. Over the period 2007-10 I produced over 80 reports analysing player and team performance, investigating the critical success factors (CSFs) for football, and developing a Value-for-Money metric to identify undervalued players. We established proof of concept but at that point the MLS was too small financially to offer sufficient returns to sustain the investment needed to develop analytics in a team. I turned again to the EPL but with the same lack of interest as I had encountered earlier. The interest in my work now came from outside football entirely – rugby league and rugby union.

               The first coach to take my work seriously enough to actually engage with me directly was Brian Smith, an Australian rugby league coach. I spent the summer of 2005 in Sydney as a visiting academic at UTS. I ran a one-day workshop for head coaches and CEOs from a number of leading teams mainly in rugby league and Aussie Rules football. One of the topics covered was Moneyball. Brian Smith was head coach of Paramatta Eels and had developed his own system for tracking player performance. Not surprisingly, he was also a Moneyball fan. Brian gave me access to his data and we had a very full debrief on the results when Brian and his coaching staff visited Leeds later that year. It was again rugby league that showed real interest in my work after I finished my collaboration with Billy Beane. I met with Phil Clarke and his brother, Andrew, who ran a sport data management company, The Sports Office. Phil was a retired international rugby league player who had played most of his career with his hometown team, Wigan. As well as The Sports Office, Phil’s other major involvement was with Sky Sports as one of the main presenters of their rugby league coverage. I worked with Phil in analysing a dataset he had compiled on every try scored in Super League in the 2009 season and we presented these results to an industry audience. Subsequently, I worked with Phil in developing the statistical analysis to support the Sky Sports coverage of rugby league including an in-game performance gauge that included a traffic-lights system for three KPIs – metres gained, line breaks and tackle success – as well as predicting what the points margin should be based on the KPIs.

              But Phil’s most important contribution to my development of analytics with teams was the introduction in March 2010 to Brendan Venter at Saracens in rugby union. Brendan was a retired South African international who had appeared as a replacement in the famous Mandela World Cup Final in 1995. He had taken over as the Director of Rugby at Saracens at the start of the 2009/10 season and instituted a far-reaching cultural change at the club, central to which was a more holistic approach to player welfare and a thorough-going evidence-based approach to coaching. Each of the coaches had developed a systematic performance review process for their own areas of responsibility and the metrics generated had become a key component of the match review process with the players. My initial role was to develop the review process so that team and player performance could be benchmarked against previous performances. A full set of KPIs were identified with a traffic-lights system to indicate excellent, satisfactory and poor performance levels.  This augmented match review process was introduced at the start of the 2010/11 season and coincided with Saracens winning the league title for the first time in their history. The following season I was asked by the coaches to extend the analytics approach to opposition analysis, and the sophistication of the systems continued to evolve over the five seasons that I spent at Saracens.

              I finished at Saracens at the end of the 2014/15 season although I have continued to collaborate with Brendan Venter on various projects in rugby union over the years. But just as my time with Saracens was ending, a new opportunity opened up to move back to football, again courtesy of Billy Beane. Billy had been contacted by Robert Eenhoorn, a former MLB player from the Netherlands, who is now the CEO of AZ Alkmaar in the Dutch Eredivisie. Billy had become an advisor to AZ Alkmaar and had suggested to Robert to get me involved in the development of AZ’s use of data analytics. AZ Alkmaar are a relatively small-town team that seek to compete with the Big Three in Dutch football (Ajax Amsterdam, PSV Eindhoven and Feyenoord) in a sustainable, financially prudent way. Like Billy, Robert understands sport as a contest and sport as a business. AZ has a history of being innovative, particularly in youth development with a high proportion of their first-team squad coming from their academy. I developed similar systems as I had at Saracens to support the first team with performance reviews and opposition analysis. It was a very successful collaboration which ended in the summer of 2019 with data analytics well integrated into AZ’s way of doing things.

              Twenty years on, the impact of Moneyball has been truly revolutionary. Data analytics is now an accepted part of the coaching function in most elite team sports. But teams vary in the effectiveness with which they employ data analytics particularly in how well it is integrated into the scouting and coaching functions. There are still misperceptions about Moneyball especially in regard the extent to which data analytics is seen as a substitute for traditional scouting methods rather than being complementary. Ultimately an evidence-based approach is about using all available evidence effectively, not just quantitative data but also qualitative expert evaluations of coaches and scouts. Data analytics is a process of interrogating all of the data.

So what are the lessons from my own experience of the transferability and applicability of Moneyball? I think that there are two key lessons. First, it is crucial that there is buy-in to the usefulness of data analytics at all levels. It is not just leadership buy-in. Yes, the head coach and performance director must promote an evidence-based culture but the coaches must also buy-in to the analytics approach for any meaningful impact on the way things actually get done. And, of course, players must buy-in to the credibility of the analysis if it is to influence their behaviour. Second, the analyst must be able to understand the coaching problem from the perspective of the coaches, translate that into an analytical problem, and then translate the results of the data analysis into actionable insights for the coaches. There will be little buy-in from the coaches if the analyst does not speak their language and does not respect their expertise and experience.

Read Other Related Posts

The Six Stages of the Analytics Process

Executive Summary

  • The analytics process can be broken down further into six distinct stages:  (1) Discovery; (2) Exploration; (3) Modelling; (4) Projection; (5) Actionable Insight; and (6) Monitoring
  • Always start the analytics process with the question: “What is the decision that will be impacted by the analysis?”
  • There are three principal pitfalls in deriving actionable insights from analytical models – generalisability, excluded-variable bias, and misinterpreting causation

The analytics process can be broken down further into six distinct stages:

  1. Discovery
  2. Exploration
  3. Modelling
  4. Projection
  5. Actionable Insight
  6. Monitoring
Figure 1: The Six Stages of the Analytics Process

Stage 1: Discovery

The discovery stage starts with a dialogue between the analyst and decision maker to ensure that the analyst understands the purpose of the project. Particular attention is paid to the specific decisions for which the project is intended to provide an evidential basis to support management decision making.

The starting point for all analytics projects is discovery. The Discovery stage involves a dialogue with the project sponsor to understand both Purpose (i.e. what is expected from the project?) and Context (i.e. what is already known?). The outcome of discovery is Framing the practical management problem facing the decision-maker as an analytical problem amenable to data analysis. It is crucial to ensure that the analytical problem is feasible given the available data.

Stage 2: Exploration

The exploration stage involves data preparation particularly checking the quality of the data and transforming the data if necessary. A key part of this exploration stage is the preliminary assessment of the basic properties of the data to decide on the appropriate analytical methods to be used in the modelling stage.

Having determined the purpose of the analytics project and sourced the relevant data in the initial Discovery stage, there is a need to gain a basic understanding of the properties of the data. This exploratory data analysis serves a number of ends:

  • It will help identify any problems in the quality of the data such as missing and suspect values.
  • It will provide an insight into the amount of information contained in the dataset (this will ultimately depend on the similarity and variability of the data).
  • If done effectively, exploratory data analysis will give clear guidance on how to proceed in the third Modelling stage.
  • It may provide advance warning of any potential statistical difficulties.

A dataset contains multiple observations of performance outcome and associated situational variables that attempt to capture information about the context of the performance. For the analysis of the dataset to produce actionable insights, there is both a similarity requirement and a variability requirement. The similarity requirement is that the dataset is structurally stable in the sense that it contains data on performance outcomes produced by a similar behaviour process across different entities (i.e. cross-sectional data) or across time (i.e. longitudinal data). The similarity requirement also requires that there is consistent measurement and categorisation of the outcome and situational variables. The variability requirement is that the dataset contains sufficient variability to allow analysis of changes in performance but without excessive variability that would raise doubts about the validity of treating the dataset as structurally stable.

Stage 3: Modelling

The modelling stage involves the construction of a simplified, purpose-led, data-based representation of the specific aspect of real-world behaviour on which the analytics project will focus.

The Modelling stage involves the use of statistical analysis to construct an analytical model of the specific aspect of real-world behaviour with which the analytics project is concerned. The analytical model is a simplified, purpose-led, data-based representation of the real-world problem situation.

  • Purpose-led: model design and choice of modelling techniques are driven by the analytical purpose (i.e. the management decision to be impacted by the analysis)
  • Simplified representation: models necessarily involve abstraction with only relevant, systematic features of the real-world decision situation included in the model
  • Data-based: modelling is the search for congruent models that best fit the available data and capture all of the systematic aspects of performance

The very nature of an analytical model creates a number of potential pitfalls which can lead to: (i) misinterpretation of the results of the data analysis; and (ii) misleading inferences as regards action recommendations. There are three principal pitfalls:

  • Generalisability: analytical models are based on a limited sample of data but actionable insights require that the results of the data analysis are generalisable to other similar contexts
  • Excluded-variable bias: analytical models are simplifications of reality that only focus on a limited number of variables but the reliability of the actionable insights demands that all relevant, systematic drivers of the performance outcomes are included otherwise the results may be statistically biased and misleading
  • Misinterpreting causation: analytical models are purpose-led so there is a necessity that the model captures causal relationships that allow for interventions to resolve practical problems and improve performance but statistical analysis can only identify associations; causation is ultimately a matter of interpretation

It is important to undertake diagnostic testing to try to avoid these pitfalls.

Stage 4: Projection

The projection stage involves using the estimated models developed in the modelling stage to answer what-if questions regarding the possible consequences of alternative interventions under different scenarios. It also involves forecasting future outcomes based on current trends.

Having constructed a simplified, purpose-led model of the business problem in the Modelling stage, the Projection stage involves using this model to answer what-if questions regarding the possible consequences of alternative interventions under different scenarios. The use of forecasting techniques to project future outcomes based on current trends is a key aspect of the Projection stage.

There are two broad types of forecasting methods:

  • Quantitative (or statistical) methods of forecasting e.g. univariate time-series models; causal models; Monte Carlo simulations
  • Qualitative methods e.g. Delphi method of asking a panel of experts; market research; opinion polls

Stage 5: Actionable insight

During this stage the analyst presents an evaluation of the alternative possible interventions and makes recommendations to the decision maker.

Presentations and business reports should be designed to be appropriate for the specific audience for which they are intended. A business report is typically structured into six main parts: Executive Summary; Introduction; Main Report; Conclusions; Recommendations; Appendices. Data visualisation can be a very effective communication tool in presentations and business reports and is likely to be much more engaging than a sets of bullet points but care should be taken to avoid distorting or obfuscating the patterns in the data. Effective presentations must have a clear purpose and be well planned and well-rehearsed.

Stage 6: Monitoring

The Monitoring stage involves tracking the project Key Performance Indicators (KPIs) during and after implementation.

The implementation plans for projects should, if possible, have decision points built into them. These decision points provide the option to alter the planned intervention if there is any indication that there have been structural changes in the situation subsequent to the original decision. Hence it is important to track the project KPIs during and after implementation to ensure that targeted improvements in performance are achieved and continue to be achieved. Remember data analytics does not end with recommendations for action. Actionable insights should always include recommendations on how the impact of the intervention on performance will be monitored going forward. Dashboards can be a very effective visualisation for monitoring performance.

Read other Related Posts