Simpsons Paradox

It’s often said that you can prove anything with statistics. Simpsons Paradox is one of the ideas that bolsters this view outlining how the relationship between 2 things might be the opposite of what it initially appears due to the proportions of cases within some sub-category. Simpsons paradox is one of my favourite statistical artifacts and I wanted to share it with you.

Example

Admissions to Berkeley University is one of the best know examples.

Looking at the 1973 admissions figures for Berkeley University seems to show a clear bias against Women with them getting in at a significantly lower rate than Men.

simpsons table 1.PNG

When the admissions figures are broken down by department, we see that women are admitted at a higher rate than men at 4 of the 6 departments, in one case substantially higher.

simpsons table 2.PNG

The reason this does not flow through is that Men were substantially more likely to apply to departments that have a high admissions rate while women were more likely to apply to tougher or oversubscribed departments. Given that the relationship between gender and applications is significantly stronger than the relationship between gender and admittance rates we should conclude that the cause in the overall difference in admittance rates by gender are due to the difference in applications not any bias against women in admittance rates.

The Theory

The theory is pretty simple. If there is a strong relationship between variables A&B and also between B&C then it can cloud or overwhelm a weaker relationship between A&C to make the relationship appear to work in the opposite direction.  

Using the Berkeley example, there is a strong relationship between gender and department (A&B). There is also a strong relationship between department and admissions rates (B&C). Because men apply to departments with higher admissions rate it artificially inflates their overall admissions rates above those of women’s even though the women outperform the men on the whole (reversing the weaker relationship between A&C)

Why it matters and how to guard against it

It matters because it can affect any decision made using statistics. To use a football analogy. If you are trying to decide between one of two strikers you will often look at their goals scored. If striker B typically plays lower quality opposition and strikers should score more against lower quality opposition, this will artificially inflate their figures compared to A. By dissecting performance by another variable with a stronger relationship to both variables (quality of opposition) we can see past the paradox to the true relationship underneath and, hopefully, make the better decisions. This is the trick to guard against Simpsons Paradox. If you think you see a relationship between two variables, A & B, check to see if there is a third variable C that has a stronger relationship with A & B than A&B have with each other. This third variable might be distorting the relationship. It seems simple but, in a world where statistically illiterate people are being asked to spend more and more time looking at statistics, it is crucial that this simple issue is not just known by a few but is common knowledge and that the solution is also commonly known and routinely applied.

Previous
Previous

The 12 Islanders

Next
Next

What makes a good voting system?