The original idea for the “Practical Analysis” blog series came from a seemingly simple question: “What is analysis?” Answering that question took me on a fascinating journey from Florence Nightingale’s work to improve public health in the mid-19th century to the most recent developments in the fields of machine learning and artificial intelligence.

Though I’d found some interesting anecdotes and instructive examples, it felt that something was still missing. Like the answer to the original question. Maybe I needed to ask some different questions that would yield more useful answers such as: What are the essential tools that 21st century analysts should have in their toolbox? Where did they come from? What was the thinking behind them? And ultimately, why do they matter — today?

So it was back to the library! And after some digging, I settled on three distinct, though interrelated, areas that I believe represent the foundation of knowledge upon which much of what we do as analysts rests:

  • Exploratory Data Analysis
  • Visualization
  • Variation over time

Each of these areas has evolved over decades and withstood the test of time to make its way into common use in the era of pervasive computing. My goal in this post is to provide an overall summary and show how these ideas fit together in practice. Then in subsequent posts, I’ll delve into more detail and provide some useful references so you can explore further yourself.

Exploratory Data Analysis

Exploratory Data Analysis was the brainchild of John Tukey – one of the most brilliant mathematicians of the 20th century. Both a prominent researcher at Bell Labs and a professor at Princeton, Tukey clearly knew his was around both theoretical mathematics and its practical application in statistics. He came to realize that it was all too easy to apply the wrong tools to a problem, or draw erroneous conclusions from statistical results.

Up to the 1970s, most of the emphasis in the field of statistics had been on “confirmatory statistics” – techniques used to prove a hypothesis based on observational data that had been collected specifically for that purpose. But with the advent of computer technology an abundance of data was becoming available that could be used to discover new, potential hypotheses.

The challenge was that you needed a starting place. For example, you’d need to answer questions like: What is the shape of the data in a new data set? Does it conform to the assumptions required by confirmatory techniques, like a normal distribution? What is the impact of outliers? Answering those questions led Tukey to a set of theories and techniques that help in understanding data sets well enough to appropriately apply the tools of confirmatory data analysis.

Visualization

One thing Exploratory Data Analysis made clear was that being able to visualize data aided greatly in understanding it. Tukey emphasized this in his own work. At Princeton, Tukey co-taught a course with Edward Tufte who had also become interested in visualization. Tufte went on to write what is widely considered the seminal text on harnessing the power of pictures to communicate information: The Visual Display of Quantitative Information, which he published it himself to allow for more creative freedom — and lots of illustrations. It explains, in detail, the psychology and physiology behind why visual representations work so well for us humans.

In his book, Tufte draws on examples of time-proven techniques and also sets the stage for how computers could advance the interpretation of information in the future. The first edition was published right around the time that computing was becoming more accessible, with mini-computers and graphical workstations in the commercial space, and personal computers for the masses. Automated forms of visualization were all of a sudden something that was not only possible, but also practical and accessible. And much of what was to come was inspired by Tufte’s work.

Variation

Finally, but no less important, is the topic of variation. Many things in nature are fairly predictable, though this comes with some degree of uncertainty. Understanding the nature of this uncertainty provides for powerful insights into what is “expected” and what is “unusual”.

While working in a Western Electric telephone manufacturing plant in the 1920s, Walter Shewhart recognized the utility of measuring variation. By observing patterns in variation over time, he was able to derive a set of rules that helped to determine when manufacturing processes were beginning to exceed acceptable tolerances. His work evolved into the field of industrial engineering with the specific set of practices coming to be known as statistical process control.

Edwards Deming advanced Shewhart’s work by applying it on a larger scale to manufacturing processes in post-World War Two Japan. The power of statistical process control became clear as the quality of Japanese-produced products became progressively better, ultimately challenging the U.S. and the rest of the world. In fact the U.S. auto manufacturing industry which had dismissed Deming’s ideas, came close to wholesale collapse, largely as a result of Japanese competition, requiring a “bail out” from the U.S. government in the late 1970s.

With hard learned lessons behind them, U.S. manufacturers ultimately embraced programs such “Six Sigma” and “LEAN” which have their roots in statistical process control. This disciplined approach continues today and is considered fundamental to ensuring quality in manufacturing the world over.

And the Journey Continues

The combination of techniques emerging from exploratory data analysis, visualization and measuring variance provides much of the basis for what is being done today across just about every industry and endeavor that involves data. And though they may be deeply embedded in computer systems and applications, knowing how to apply these techniques to new problems to facilitate data driven decisions will be essential to the value that analysts provide as technology continues to advance.

Next up: the fascinating story behind Exploratory Data Analysis!

Related Reading:

Test Drive Software