There’s a heavy dependence on data statistical analysis in business and our lives. People are biased; they lie, they make assumptions, and they have been proven time and again to make irrational decisions, but data doesn’t lie…or does it? We’re in a digital renaissance and times are changing faster than we can possibly keep up with, it’s important to get it right. I believe in the awesome power of statistics, but it’s easy to see how the human touch, or lack of, can skew results.
An algorithm is basically a big, complicated IF/THEN logic statement (retrieve junior high math class from the memory banks). If A and B then C. If you’re young (A) and broke (B) then here is an advertisement for a Happy Hour (C). Everyone has had to fill out a resume profile that will never be seen by a human being; we log on to Facebook and see ads specifically geared towards our demographic information, and we hear statistics in every aspect of our lives. The irony is, statistics can only aggregate the information you give it, and it gets it wrong all the time, sometimes tragically so. You may have extensive experience in an industry but because your resume didn’t have a buzzword, it will never make it past the online portal. Thanks Oracle. You may be a childless 30-year-old woman, but Facebook keeps showing ads for baby strollers simply because you liked one of your friend’s pictures with an identifying hashtag. Thanks Facebook. You might be a young white man and not have use of your legs but because you watched one video on YouTube about Mixed Martial Arts you are now bombarded with ads for gym groupons. Thanks YouTube. I think you get my point. Why then, if data gets us wrong so easily, do we trust it so much?
Causation through statistics is a huge part of analysis, but it also gets it epically wrong sometimes. A few examples of statistical causation gone bananas:
The age of Miss America is 87% correlated to murders by steam, hot vapors, and hot objects. Statistically, you must be able to predict a change in one item with the other? Not at all.
Crude Oil imports from Norway and drivers killed in collisions with railway trains shows a 95% correlation…
The consumption of cheese and people who died by being tangled in their bedsheets also shows a 95% correlation…
Economists have controlled searches for these results, calling them “spurious correlations”, but it goes to show how complicated and out of control some statistical models can become without some finesse. Any human can tell that these correlations are nonsense, but according to the fundamentals of Data Science, they have overwhelming causation. The only thing keeping these graphs out of a business presentation is common-sense. That’s where the human touch comes in. When building a statistical model or machine-learning algorithm, we must control for certain variables that are either completely unknown, or assumed to be identical throughout the population. Like in Jurassic Park when they use frog DNA to fill in the gene sequence gaps; in the absence of complete information, we do the best we can. Referring to my previous examples: Facebook is assuming that a 30-year-old woman is a mother and that a young white man watching MMA would want to join a gym. The underlying assumptions need to be as unbiased as possible, or not exist at all, for the statistical model to work properly. Additionally, common-sense needs to be applied when interpreting the results and determining market causation before the information is delivered.
This TED Radio Hour Podcast shows an excellent example of building algorithms that are ultimately biased against women and African Americans. Algorithms only know what we teach it through internal historical information. So, if only men have historically been successful within a certain company (because of sexism) programming a resume-bot to filter resumes to see only the most likely people to be successful will unintentionally filter out more women than men. If historically, people are more successful who have had no employment gaps in their resumes that show X, Y, and Z skills (simply by happenstance because the economy was good prior to 2007) then people who experienced prolonged unemployment during the recession and do not have X, Y, and Z would also be unintentionally filtered. Common-sense tells us that these people should not be immediately disqualified. Cyclical causality is an inevitable byproduct of human programming and using historical information to build automation, we must be careful not to simply automate the status quo.
Dr Cathy O’Neil, in her book “Weapons of Math Destruction” refers to Data Science as data without the science. Science is repeatable, it uses the scientific method, it is peer-reviewed, and offers evidence supporting its conclusions. Data science offers no such infrastructure. It is all about who owns it, who is interpreting it, what assumptions are made when the data is incomplete, how the data is collected, what their motivations are with the information, and how it is applied. Making the process repeatable with identical results is virtually impossible, unless the same human biases and assumptions are applied to the same sample of data.
Math and science have set us free from the dark ages of superstition and religious dogma. Getting statistics right and holding data accountable is necessary if we are to grow as a digital civilization. Cloud computing, social media, and business intelligence techniques have made us exposed to so many statistics that it’s hard to tell what’s what. Evolving from using underlying assumptions as well as providing controls within the complexities of data is key to advancing into Artificial Intelligence (AI) and quantum computing.
Food for thought. Thank you for your time.
“73.6% of all statistics are made up on the spot” – a smart-ass
“There are three kinds of lies: lies, damn lies and statistics.” – Prime Minister of Great Britain, Benjamin Disraeli