Too much data and not enough useful information is one of the great paradoxes of the era of large scale, pervasive computing. Improving that balance is a key challenge for the modern-day analyst. That requires becoming familiar with the most appropriate tools for generating meaningful insights. And there are some fairly basic yet extremely powerful ones that you need to know about.
This next chapter of the Practical Analysis blog series, beginning with this post, is dedicated to endowing you with that knowledge.
The rise of exploratory data analysis
Powerful software tools come with a double-edged sword: it’s so easy to apply advanced analytical techniques that we may not give adequate thought to whether we’re using the ones that are best suited to a particular problem or data set.
The field of Exploratory Data Analysis (we’ll refer to it variously as “exploratory analysis” or “EDA” here) arose during the mid-20th century — before the emergence of large scale computing — to address that problem. Much progress had been made to that point in the area of traditional or “confirmatory” statistics, leading to the popularity of techniques like multi-variate regression and analysis of variance.
But there were a few problems: these approaches only function properly when data conforms to fairly stringent requirements. One prominent example is the assumption of a normal distribution. Furthermore, those techniques had traditionally been applied to what we would today consider smaller datasets that had been manually assembled from observations – in many cases to support hypotheses in traditional research based on the scientific method. As larger volumes of data became available from electronically generated sources, the temptation was to “throw” these techniques at the data without fully understanding the consequences. The results were often suboptimal, producing unreliable and misleading results.
Improving data analysis
Correcting this misapplication of traditional statistics became something of an obsession for John Tukey and a group of colleagues in academia in the 1970s. Tukey is remembered as one of the most brilliant mathematicians of the 20th century. Not only did he found and chair the statistics department at Princeton, but he was also a highly-respected researcher at Bell Labs — at the time among the leading research institutions in the world.
No ivory tower academic, Tukey was concerned about practically applying statistics in the real world. And from that point of view arose the field of exploratory analysis. Tukey observed, “It is important to understand what you ‘CAN DO’ before you learn to measure how ‘WELL’ you seem to have done it.” (His own emphasis, and somewhat quirky syntax!) In both cases, data provides the means. Confirmatory statistics supply techniques for “measuring well,” but they assume that you know what you “can do.” EDA helps with comprehending the “can do” part.
Which data is worth exploring?
Ever evolving technology generates and collects increasingly vast volumes of data, from observations in nature — think of all the variables describing the weather — to automated sensors in just about everything imaginable — what’s come to be known as the “internet of things.” These types of data provide opportunities to generate new hypotheses. However, in its most “raw” state, that data often doesn’t meet the criteria required by the more traditional confirmatory techniques. For that matter, it may not even be clear which data elements from potentially thousands of candidates will provide key insights. Enter EDA.
The premise of exploratory analysis is that raw data represents a starting point. But it will often not be particularly cooperative. Refusal to conform to a normal distribution is one problem, but there are others. The presence of outliers – in some cases extreme ones — may accurately reflect the nature of what’s being observed or, conversely, may be the result of sampling error or sensor malfunction. In either case, outliers may “pollute” the data – at least in terms of what is expected by traditional techniques.
But these are realities of real-world data. EDA introduced the concept of robust measures which would not be unduly influenced by these anomalies. Among the clearest examples of a robust measure is the median, which — unlike its close relative the mean — is resistant to extreme outliers, either high or low. If a data set contains even a single anomalous observation that is, say, orders of magnitude beyond what is reasonable, it will influence the mean in that direction – and possibly lead to incorrect interpretation. The median, on the other hand, likely won’t budge. And that in turn yields a more sensible perspective on the data for an analyst trying to determine where to begin.
In the next post I’ll get into some of the more technical aspects of Exploratory Data Analysis.
Read more of our Practical Analysis blog series:
- Practical Analysis: The Next Chapter - May 21, 2020
- Exploratory Data Analysis Part 2: Helping You Make Better Decisions - October 11, 2019
- Practical Analysis: Understanding Visualization Concepts - September 19, 2019