Big Data science analysis business technology concept on virtual screen.The “Practical Analysis” blog series is dedicated to answering the questions: “So what is analysis anyway?” and “How can I apply it in my world?” The first installments focused on the fundamentals from which modern computer-aided analysis and decision support have evolved. They featured topics like visualization, exploratory data analysis, and statistical process control.

In the next posts, we’ll explore the emergence of what we now know as “data science” from those beginnings. Let’s examine.

Tukey’s impact on data science

My research into analysis kept invariably pointing me in one direction and I think that is an appropriate place to start this chapter. In 1962 John Tukey, one of the most prolific mathematicians and statisticians of the 20th century, wrote a provocative article in The Annals of Mathematical Statistics titled “The Future of Data Analysis.”

Tukey’s intent was to take the mathematical statistics community, of which he was a member, to task for their nearly exclusive focus on a highly theoretical approach that offered limited practical value. Much of the academic research in statistics to that point had been consumed with creating inferential models from relatively small data sets and understanding the limitations of their reliability.

But meanwhile, early computing technology was quickly evolving in ways that made increasingly larger datasets both possible and practical. For example, the introduction of punch card machines helped to automate the U.S. decennial census leading to one of the first large, well curated, and reliable datasets.

Part of Tukey’s argument was that data itself held the promise of illuminating many additional theoretical possibilities. Furthermore, this approach could spawn practical applications significantly beyond those that had arisen from the obsession with small data.

Tukey’s initial training as a chemist equipped him with the dual perspectives of rigorous theory and an evidence-based approach to applying theory to practice. In The Future of Data Analysis, he drew on the physical sciences for examples of how data analysis could evolve as a science of its own. Understanding the properties and patterns of data in general could help to facilitate discoveries across disciplines. He even went as far as to claim that data as a science could be among the most complex and challenging areas of research.

It’s interesting that though Tukey has been credited with coining the computer age terms “bit” and “software,” it would take more than a decade for “data science” to enter the technical vocabulary even though the field quickly took shape more or less how he had predicted.

Innovations lead to modern data science

If you’d like to understand the emergence of data science from its roots in The Future of Data Analysis, I’d suggest reading the paper “50 Years of Data Science” by David Donaho. Donaho, a professor of statistics at Stanford and one of John Tukey’s many proteges, gave the lecture on which the paper is based during a conference celebrating the centennial of Tukey’s birth and his contributions to math, science, and society.

Donaho’s paper chronicles the innovations in both the academic and industrial realms that have led to today’s data science. One dynamic that likely accelerated these developments was that Tukey had a foot in both camps: he was the founding chairman of the statistics department at Princeton University as well as a senior researcher at Bell Labs, one of the foremost research institutions in the world at the time.

On the academic side, Tukey and his collaborators pioneered the field of exploratory data analysis while also refining existing statistical methods such as regression and analysis of variance. Simultaneously, his associates at Bell Labs were applying those theories in domains such as telecommunications networks and semi-conductors while also developing the software technology that would serve as the foundation of modern machine learning. The S statistical language, a predecessor to open source R language, was developed at Bell Labs by Tukey associate John Chambers. It has played an important role in the progression of software tools that have paved the way for applying advanced statistics to ever larger datasets.

A recurring theme in the work that Tukey undertook during his career is the impact of data and evidence-based thinking on issues that affect everyday life. These ranged from his role in investigating the methods used by Alfred Kinsey in his research into human sexuality to the effects of pesticides on the environment to the accuracy of Census Bureau population estimates and the reliability of predicting election results based on partial and rapidly changing information. These are all inspiring and instructive examples of how data-driven decisions can and do shape our world.

Inferential and predictive approaches to statistical modeling

Among the ongoing debates that 50 Years of Data Science highlights is the tug of war between the inferential and predictive approaches to statistical modeling. The goal of the inferential approach is to develop generalized, or generative, models that explain the transformation between a set of inputs and the results that reflect a set of outcomes.

Prediction, on the other hand, harnesses patterns in the actual data to project the likelihood of future results in an evolving series of events. While both techniques can be effectively applied to a wide range of problems, the predictive field has played a particularly influential role in the development of modern data science, providing many of the mechanisms that underlie today’s highly scalable machine learning and artificial intelligence systems.

And it is prediction that is really at the heart of what John Tukey envisioned in The Future of Data Analysis: using data to understand the phenomena that they represent and better realize the opportunities they embody.

50 Years of Data Science concludes with an examination of the Common Task Framework, a community-based, open source approach to developing progressively more effective predictive models through the use of shared datasets, well defined goals, and objective evaluation of results. The Common Task Framework is yet another step toward realizing Tukey’s vision of treating data and analysis as legitimate and compelling science.

In subsequent posts I’ll explore the algorithms, including both inferential and predictive techniques, that trace the evolution of data science and its role in fostering increasingly pervasive and effective data driven decision-making.