How to lie with statistics

By | November 30, 2014

Just before the thanksgiving break, I attended to a book club event (internal to Bing) and Scott Berkun advised to read the book “How to lie with statistics”. I decided to use the thanksgiving break time to read it.

Written by Darrell Huff in 1954, the book was extremely refreshing (the tone used in the book is illustrative of a mindset that we seem to have lost). Understanding the science behind statistics can helps us in multiple ways. First, it is easier to identify statistical bias (conscious or unconscious), secondly it helps asking the right questions when presented with graphs or numbers, lastly it helps presenting and comprehending correctly hypothesis done day to day. It is a super easy read and I strongly recommend it to any people interested in refreshing their statistical science (not math) knowledge.

The sample with the built-in bias

Every sample have a built-in bias, the people that gets the survey, the people that answers the survey, the emotional impact of the interviewers on the interviewee, the way the questions are asked. So when looking at sampled data, ask yourself what is the sample representing?

The well-chosen average

Different types of average needs to be correctly understood:

  1. Mean: summing up all values and then dividing by the number of entries
  2. Median: the middle value from all entries
  3. Mode: most frequently met value from all entries

The little figures that are not there

Using small sample groups will help with ‘injecting’ luck in the experiment. Repeat your experiment as many time as necessary and only keep the results that you like (tossing a penny ten times might be far away from 50%, whereas a thousand times it will be really close to 50%).

Test of significance: reporting how likely it is that a test figure represents a real result rather that something produced by chance. It is generally expressed as a probability. For most purpose, nothing poorer than 5% of significance is good enough (meaning 95% chance that the result is real).

It is also necessary to understand the deviation from the average that is given to understand how it might varies.

The gee-whiz graph

Representation can be misleading, changing the proportion between the ordinate and the abscissa can have a huge effect on interpretation (small 10% growth compared to a huge 10% growth). Changing the amount of data displayed can also impact the interpretation (cut the graph to only where the line is and suddenly this looks like a 50% increase).

The one dimensional picture

Although representing number by pictures (for example, money bag representing the amount of money), the perceived ratio of the picture will not be mono-dimensional. Instead people will perceive 3 dimensions. So doubling the size of a money bag, actually double its width and its length as well as its volume.

The semiattached figure

“If you can’t prove what you want to prove, demonstrate something else and pretend that they are the same thing”. There are often many ways of expressing any figure, 1% return on sales, 15% return on investment, $10M profit, increase in profit of 40% (compared to year – 10), a decrease of profit of 60% (compared to year -1).

Post hoc rides again

If B follows A, then A has to cause B, however B might be causing A, or have no correlation whatsoever. There are different type of correlation, produced by chance (cannot be easily reproduced), covariation (relationship exist but is unclear what variable is the cause and which the effect), or neither of the variable has any effect at all on the other, yet there is a real correlation (poor grades among cigarettes smoker). The cause and effect nature of the relationship is only a matter of speculation.

How to statisticulate

Statisculate: misinforming people by the use of statistical material.

  • Using average: mean, medians, mode.
  • Using decimals number paint a false accuracy (instead of a poor approximation).
  • Shifting base with percentage can distort reality (to offset a pay cut of 50%, you must get a raise of 100%).
  • Adding percentages together when they are not related.
  • Confusion between percentage and percentage points is another confusion (from 3% to 6% is a 3% points increase or a 100% increase).
  • Percentile – a rank between each one hundred

How to talk back to a statistic

5 questions to proof the statistic:

  1. who says so?
    1. look for conscious bias (misstatement or ambiguous statement; selection of favorable data; shift in units of measurement; using one year for a comparison and sliding to another more favorable; improper measure, mean vs median vs mode)
    2. look for unconscious bias
    3. When an “OK name” (a trustable one, like doctor a reputed university) make sure the conclusion came from the author, not just the data.
  2. How do you know?
  1. look for the sample size, how many people answered from the pool
  2. watch out for reported correlation, is there enough cases for significance? Is the standard error and probable error shared?
    1. What’s missing?
    2. Did somebody changed the subject?
    3. Does it make sense?

2 thoughts on “How to lie with statistics

  1. Scott Berkun

    Glad you enjoyed the book – I wish more people who claimed to be data experts would read it 🙂

    Reply

Leave a Reply

Your email address will not be published. Required fields are marked *