Category Archives: analytics

Data Science: data without the science

There’s a heavy dependence on data statistical analysis in business and our lives. People are biased; they lie, they make assumptions, and they have been proven time and again to make irrational decisions, but data doesn’t lie…or does it? We’re in a digital renaissance and times are changing faster than we can possibly keep up with, it’s important to get it right. I believe in the awesome power of statistics, but it’s easy to see how the human touch, or lack of, can skew results.

An algorithm is basically a big, complicated IF/THEN logic statement (retrieve junior high math class from the memory banks). If A and B then C. If you’re young (A) and broke (B) then here is an advertisement for a Happy Hour (C). Everyone has had to fill out a resume profile that will never be seen by a human being; we log on to Facebook and see ads specifically geared towards our demographic information, and we hear statistics in every aspect of our lives. The irony is, statistics can only aggregate the information you give it, and it gets it wrong all the time, sometimes tragically so. You may have extensive experience in an industry but because your resume didn’t have a buzzword, it will never make it past the online portal. Thanks Oracle. You may be a childless 30-year-old woman, but Facebook keeps showing ads for baby strollers simply because you liked one of your friend’s pictures with an identifying hashtag. Thanks Facebook. You might be a young white man and not have use of your legs but because you watched one video on YouTube about Mixed Martial Arts you are now bombarded with ads for gym groupons. Thanks YouTube. I think you get my point. Why then, if data gets us wrong so easily, do we trust it so much?

Causation through statistics is a huge part of analysis, but it also gets it epically wrong sometimes. A few examples of statistical causation gone bananas:

The age of Miss America is 87% correlated to murders by steam, hot vapors, and hot objects. Statistically, you must be able to predict a change in one item with the other? Not at all.

pic 1

Crude Oil imports from Norway and drivers killed in collisions with railway trains shows a 95% correlation…

pic 2

The consumption of cheese and people who died by being tangled in their bedsheets also shows a 95% correlation…

pic 3

Economists have controlled searches for these results, calling them “spurious correlations”, but it goes to show how complicated and out of control some statistical models can become without some finesse. Any human can tell that these correlations are nonsense, but according to the fundamentals of Data Science, they have overwhelming causation. The only thing keeping these graphs out of a business presentation is common-sense. That’s where the human touch comes in. When building a statistical model or machine-learning algorithm, we must control for certain variables that are either completely unknown, or assumed to be identical throughout the population.  Like in Jurassic Park when they use frog DNA to fill in the gene sequence gaps; in the absence of complete information, we do the best we can.  Referring to my previous examples: Facebook is assuming that a 30-year-old woman is a mother and that a young white man watching MMA would want to join a gym.  The underlying assumptions need to be as unbiased as possible, or not exist at all, for the statistical model to work properly. Additionally, common-sense needs to be applied when interpreting the results and determining market causation before the information is delivered.

This TED Radio Hour Podcast shows an excellent example of building algorithms that are ultimately biased against women and African Americans.  Algorithms only know what we teach it through internal historical information.  So, if only men have historically been successful within a certain company (because of sexism) programming a resume-bot to filter resumes to see only the most likely people to be successful will unintentionally filter out more women than men.  If historically, people are more successful who have had no employment gaps in their resumes that show X, Y, and Z skills (simply by happenstance because the economy was good prior to 2007) then people who experienced prolonged unemployment during the recession and do not have X, Y, and Z would also be unintentionally filtered.  Common-sense tells us that these people should not be immediately disqualified.  Cyclical causality is an inevitable byproduct of human programming and using historical information to build automation, we must be careful not to simply automate the status quo.

Dr Cathy O’Neil, in her book “Weapons of Math Destruction” refers to Data Science as data without the science.  Science is repeatable, it uses the scientific method, it is peer-reviewed, and offers evidence supporting its conclusions. Data science offers no such infrastructure. It is all about who owns it, who is interpreting it, what assumptions are made when the data is incomplete, how the data is collected, what their motivations are with the information, and how it is applied. Making the process repeatable with identical results is virtually impossible, unless the same human biases and assumptions are applied to the same sample of data.

Math and science have set us free from the dark ages of superstition and religious dogma. Getting statistics right and holding data accountable is necessary if we are to grow as a digital civilization. Cloud computing, social media, and business intelligence techniques have made us exposed to so many statistics that it’s hard to tell what’s what. Evolving from using underlying assumptions as well as providing controls within the complexities of data is key to advancing into Artificial Intelligence (AI) and quantum computing.

Food for thought. Thank you for your time.

“73.6% of all statistics are made up on the spot” – a smart-ass

“There are three kinds of lies: lies, damn lies and statistics.” – Prime Minister of Great Britain, Benjamin Disraeli


Giving the Market a Push

The Great Recession has shown us that markets have equilibrium, but that we sometimes need to give it a little push in the right direction or else we could spiral out of control; again.  The Hoover Dam got money circulating and caused a multiplier effect that helped rejuvenate the economy, World War II (unfortunately) got men and women back to work and provided a much needed sense of purpose, the fiscal stimulus of 2009 (arguably) saved the United States from calamity, and automatic stabilizers such as unemployment insurance help to minimize the effects of a loss of wages on the overall economy. We have come a long way since 1929 and still have a long way to go. As an economist and a fiscal-conservative I disagree with an over-regulated market, price-floors, and an overreaching government; but as recent economic events have shown, markets are more complicated and synthetic to adjust appropriately on their own.

One example of this is from a recent article by The Brookings Institute that shows a serious, yet ubiquitous, problem in predatory lending: pay-day loans. I have already blogged about some other industries that thrive on predatory lending and pay-day loans are subject to the same nefarious business practices.  The article summarizes that the Consumer Financial Protection Bureau (CFPB) passed legislation changing the nature of the vetting process for pay-day loan-sharking from debt-to-income ratios to a more reasonable ability-to-pay matrix for non-prime lenders as well as limiting the amount of loans they are able to take out.  Will this industry change? Absolutely. Will market innovation create new opportunities to lend to non-prime borrowers? Absolutely.  This market is littered with moral hazards so the only option is to keep a close eye on predatory financing.  George Akerlof and Robert Shiller did a great job bringing phishing scams to light (Phishing for Phools), showing that with every market comes an opportunity to take advantage.

Another such push is the Department of Labor’s Wage and Hours Division’s expansion of the Fair Labor Act that increased the overtime salary exemptions from a minimum of $23,660/year ($455/week) to $47,476/year ($913/week).  This should affect over 4 million people in the country and give a “meaningful boost to many workers’ wallets”.  I am of two minds about this legislation and my fellow blogger Adam posted about this recently in THIS BLOG POST.  The economic forces behind the need for price floors are tricky and sometimes self-defeating.  A higher nominal wage could overheat the market and cause a lower real wage; meaning that if employers are forced to pay people more they will simply hire less people in an attempt to return to a balanced aggregate wage.  This is not a one-for-one exchange, and often leads to lower aggregate wages and higher unemployment in the big picture.  Over-time exemption criteria increases are a synthetic aspect of the labor wage market but necessary nonetheless.  The unemployment rate has been lowering and overall consumption is up (, this should (if Keynes was right) cause an increase in wages and inflation in the market, but it hasn’t.  A higher wage will give existing employees a much-needed break, and it’s time, but this could also create a disincentive to hire future employees.  I guess we’ll wait and see.

Fiscal policy is a relatively new addition to economics and we’re all trying to make sense of a post-recession world.  Obviously, letting markets adjust naturally doesn’t work, but how far do we push regulation to make course corrections? It’s easy to see effects of fiscal policy with the luxury of hindsight and, as an armchair quarterback, I could write a dissertation on changing policies after the Great Recession, but we live in a world of uncharted waters and need to simply do the best we can with the information we have.  Hopefully we can get it right once in awhile.