A few weeks ago, I attended the Real Business Intelligence Conference, which was hosted by Dresner Advisory Services on the campus of MIT. The show was packed with great presentations by speakers such as Dr. Theresa Johnson of Airbnb, Professor Thomas Malone of the MIT Sloan School of Management, and David Dadoun of the Aldo Group.
Tweet: The downside of data science
One of the more thought-provoking presentations was given by Cathy O’Neil, author of the book, “Weapons of Math Destruction”, in which she argues that mathematical models and algorithms give the illusion of being impartial, but in actuality, many of them end up perpetuating stereotypes and inequality. Following Cathy’s presentation, I sat down to read her book. Here are some of my thoughts.
The dark side of big data
In her book, Cathy discusses several examples of models gone wrong. These models are being used in fields such as:
- Education: Some school districts are now evaluating teacher performance based on algorithms that factor in student test scores vs. expected scores, and are using these models over other factors such as observed reviews. When teachers question the models, they are prohibited from knowing exactly what factors go into them.
- Lending: Many loan decisions are being made based on models that factor in variables such as ZIP code. Failure to get a loan can result in lost access to education and resources, thereby perpetuating the cycle of poverty.
- Hiring: Some companies are now using algorithms that predict work performance based on qualities such as extraversion, agreeableness, conscientiousness, neuroticism, and openness to ideas. There are several problems with this approach. Among them: 1) the models do not incorporate actual performance into them so they are not getting feedback as to the accuracy; and 2) research shows that personality tests are poor predictors of performance.
These are just a few examples that Cathy uses in her book, but they illustrate some of the downside of big data. Organizations typically don’t intend to use data in bad ways; however, it can often be difficult to distinguish between “good uses” and “bad uses” of data.
So what do you do when you are sitting down and trying to make use of the treasure trove of data you have in front of you? Here are some key things I think we all should consider.
1. Be aware of what data is going into your model
You might have a whole lot of data at your disposal, but is it right to use all of it? In my view, it’s important to consider your ultimate outcome and work backwards from there. Do you want certain pieces of data to factor into that decision? For example, if you are a student loan provider, do you want ZIP code information to largely factor into your model? If your ultimate goal is profit, perhaps you’d say yes. But if your goal is to provide access to education, your answer might be different. There’s some data out there that if I had access to it, I’d feel icky about using it in most models.
2. Gain feedback on your model
One of the issues with the personality tests that are used for hiring that Cathy discussed is that the companies that use them do not gain feedback on the model. Were the people they hired actually better than those they didn’t? There’s really no way for them to know.
Some models are bad. Some get stale. Just because you have data, it doesn’t mean you have to use it. Make sure you understand the success of the outcomes of your model, and incorporate that feedback to refine the model moving forward.
3. Be transparent
If you’re fired from your job, you naturally want to know why. However, as Cathy wrote, for teachers who were let go from their jobs because they performed poorly according to an algorithm, they don’t know the answer. When the teachers she spoke to asked to have access to the factors that went into the model they were being judged by, their school districts did not know the answer. And the company that developed the model would not divulge that information.
From my perspective, if you’re going to use the data, be upfront about it. “Here is what I am judging you on and this is why.” Then perhaps you can have an actual discussion about the results. That leads to my last point…
4. Use the data as a discussion point, not necessarily as a decision-maker
It’s true that good data leads to better decisions. But I would argue that the data should guide your decision, not be the ultimate arbiter. Take, for example, a scenario that many of our healthcare customers face. These customers use our software to measure physician performance, by examining data such as DRG, cost, average length of stay, mortality, readmits, and much more, and then comparing them to benchmarks.
When I have talked to customers about their process as to how they are using the data, the best use cases involve the hospitals using the data as a starting point and sparking a conversation with physicians to understand why certain anomalies are occurring. The data ends up not being a judgment, but serves as a reference point. From there, physicians and administrators can have productive discussions rather than point fingers.
These customers also use the “diving” capability of our software (hence, the Diver name) to explore the data and answer questions that arise. Here, the data provided by the model serves more as a question than as the actual answer.
A big thanks to Cathy O’Neil for her interesting presentation and book – it gave me a lot to think about! What am I missing here? Any other thoughts on how we as data experts can combat bad uses of data? Let me know in the comments below.