story | Ryan Ma, Contributing Reporter
photo | Ryan Ma
“Why is a data science book like a jazz piece written by Bill Clinton’s Vice President? Both are filled with Al Gore Rhythms.” With a bad dad joke, Professor Richard D. De Veaux from Williams College kickstarted a lighthearted but informative Rector’s Tea on August 21, where he shared lessons on statistics that he had learned in his personal, professional, and academic life.
The talk, aptly titled “The Seven Deadly Sins of Data Science– and How to Avoid Them,” was held in the evening at Cendana Rector’s Commons. Among first-year students taking the Common Curriculum, Prof. De Veaux is known as an author of their Quantitative Reasoning (QR) textbook, “Stats: Data and Models”. As a statistician, his illustrious career has brought him to Stanford, Princeton, and the consulting industry. He has also patented several statistical methods, and he is currently the Vice-President of the American Statistical Association.
At the talk, Prof. De Veaux explored a crucial yet unsettling question – what could go wrong in statistical research? Given the ubiquity of big data in academia and industry, this question concerns all of us regardless of which field we are pursuing.
There is a difference between data scientists and statisticians, he said. The problem with data scientists is that they do not think enough about the problem they are solving. “What do statisticians bring to data science? They ask questions,” he said.
In other words, researchers should clearly understand the problem at hand before analyzing their data. Prof. De Veaux illustrated this using a case he encountered in his consulting career. A company had a high failure rate of 25% on their products, and tried to investigate it using six months of production data. After a long period of laborious work, the company’s researchers seemed to have struck gold – three groups of well-structured data. When they eagerly presented this data to engineers, however, they were told that the company makes three different products.
Moral of the story – always define and understand your problem before analyzing it quantitatively. As Prof. De Veaux wrote in the QR textbook, statistics questions are answered with sentences, not numbers.
Another pitfall raised by Prof. De Veaux is not taking the time to prepare one’s data. There is a Chinese saying that goes, “Sharpening one’s axe has never interfered with the cutting of wood.” The same goes in statistics.
Prof. De Veaux’s data mining students are often tested using a massive dataset from a real-world problem. The non-profit organization, Paralyzed Veterans of America (PVA), would mail out millions of address labels urging the public to give free gifts to veterans suffering from spinal cord injury. With mail being less and less commonly used nowadays, the PVA sought to optimize the number of donations by predicting what kinds of people were most likely to donate.
To do so, they collected data on 200,000 prospective donors. The data, however, was full of inconsistent values. Real data, he said, is often fraught with errors like these. Students who jumped into analyzing the data without first cleaning it up usually ended up with flawed results. To make things worse, many of his students blindly presented distributions of a variable named “T-Code” without first finding out what it represented. It turned out, eventually, that “T-Code” values were simply identifiers for each person’s title, such as “Mr”, “Mdm”, or “Your Imperial Majesty”.
Lesson learned – always take the time to prepare your data before analyzing it. As always, as we have learnt from the QR textbook, make a picture of our data so that we know what we are working with.
Lastly, Prof. De Veaux urged researchers to avoid blindly trusting their data. Just like information on the news, data is often prone to mistakes, manipulation, and misrepresentation. As a consultant, Prof. De Veaux once worked on a problem for American Express, which had released a new credit card called the “Delta Double Miles Card”. They then tasked him with finding out if the new card encouraged customers to spend more money.
His research showed that customers who received the card spent an average of $100 USD ($139 SGD) more than those who did not. This seemed like a statistically and financially significant result, but it later turned out that the difference was entirely caused by a single customer who spent $3 million USD ($4.17 million SGD) on the new card. The difference was spread over 30,000 customers, giving the impression that every customer spent $100 USD more on average.
As students, we may be surprised that seasoned statisticians would make such rookie-level mistakes. After all, the vulnerability of mean values to extreme outliers would have been covered in any introductory statistics class. This story shows that data is hardly as objective as many imagine it to be.
One does not need to be a statistician to appreciate the importance of statistics. Whether you have set your mind to a Mathematical, Computational and Statistical Sciences degree or dread your twice weekly QR lessons, it is clear that data analysis is essential to research in almost every field. Prof. De Veaux’s sharing therefore encourages us to rethink the role of data in the liberal arts, and challenges us to question our biases in both research and our daily lives.