Correlation is not Causation

“Correlation is not causation” or “Correlation does not imply Causation” are very commonly used today in the field of AI, which makes it more important!!

Let's start with an example by comparing “Cholesterol” with “Exercise” and try to understand the relationship

Observations:

  • On the left graphs, it seems like more exercise lead to higher Cholesterol
  • But in the right graphs, we see there is a confounder, “Age”, that influences both “Exercise” and “Cholesterol”
  • 2nd graph makes more sense?? right?? It agrees with the truth that more exercise will lead to lower cholesterol

Now lets define the terms we are trying to understand here!!!

  • Correlation: It is the statistical measure that defines the size and the direction of a relationship between 2 variables. It does not tell us if change in one would cause a change in others. Also doesn’t tell us “why and how” behind the relationship, just says relationship exists.

“A bunch of variables that vary together for a very long time, or vary cohesively”- Judea Perl

  • Causation: It is the causal relationship or the cause & effect that relates 2 variables. Changes in one variable would cause a change in the other one. One variable can make the other happen.

Example :

  1. Smoking causes an increase in the risk of developing a lung cancer. Both are causally related. Whereas, Smoking can be correlated with alcoholism, whereas smoking does not cause alcoholism. It is difficult to define a “cause and effect”, whereas developing a correlation is not hard.
  2. In 2012, Misserli showed a strong positive relationship between chocolate consumption and number of Nobel laureates per country. These 2 are highly correlated, but not to mistake correlation with causation. “Eating chocolates does not produce Nobel Prize winners”.

How to find Correlation : If the variables are statistically correlated then there is a correlation coefficient that describes the degree of relationship between these two variables. It ranges between -1 to +1. It shows the strength and direction of the relationship.

How to find the Causal relationship: In a controlled study, the sample or the population is usually split into 2 groups. The 2 groups receive different treatments and outcomes are assessed. Different experiences may have caused different outcomes. It is done to provide a statistical relationship whether causality exists between the two variables or not. Although it's not always possible to go for a controlled study.

Note: Correlation may be due to coincidence, one variable influencing another doesn’t mean it's causing the other one. Example: Ice cream sales is correlated with homicides in New York (Study). Although there might be a correlation, it doesn’t mean one causes another.

  • Sometimes, variables might happen to be highly correlated with each other over a period of time

A very good website called Spurious Correlations by Tyler Vigen. Where public data has been used to point out some unexpected and funny correlations with their plots. I highly suggest checking this website.

These spurious correlations have also made it to the news. News articles posting “Alcohol causes 20,000 cancer deaths in U.S annually” whereas other news saying “One Drink of Red wine or Alcohol is relaxing to circulation”. Such things can be very contradictory. There has been a lot of errors in Scientific reporting when these spurious correlations have been mentioned and they are then taken to be causal relations.

Overall Problems:

  1. Mostly, when we see a correlation, we are actually thinking of causation.
  2. The data that we see is not all the data that there is.
  3. If the variables don’t see each other, they won't vary together forever
  4. Why do we see a correlation if there is no causation? Because in most cases we have observational data, which is just looking at those specific events that vary together, hence we see a correlation. Because here we are ignoring certain incidents here, thus we tend to find a correlation(which actually isn’t there)
  5. Even the well-trained scientists misinterpreted correlation and causation in the opposite direction. In the 1950’s some statisticians got confused with tobacco causing cancer. They argued that without a randomized experiment of comparing smokers with nonsmokers, this could not be established. Eventually, the causal relationship was established between tobacco and cancer. Such problems exist.

Conclusion:

  1. In our Machine learning models, Variables varying together(correlation) which reveals a powerful and undiscovered connection with predictive and explanatory power. But on the other hand, it may represent just statistical noise or bias in the data. Thus there is a flawed tendency to interpret causality rather than correlation.
  2. Moving from Correlation to Causation is essentially important when it comes to understanding the conditions under which our Machine Learning Model might fail. It is important to know causation behind your model for effective Decision Making.
  3. We have examples(as mentioned above) that prove that these correlations and predictive analysis can fail, within some specific cases, which is also known as Simpsons Paradox.
  4. Associations(or Correlations) might happen due to chance. Increasing the sample size might help for reducing these associations due to chance in some cases.
  5. Never come to a conclusion just by looking at correlations. Other underlying factors should be considered for the analysis and before getting to a conclusion.
  6. There is a need to understand the “Data Generating Process” or the Causal Model. After which we can try to understand which factors influence the results. It requires looking beyond the data.

Resources:

  1. Forbes: Machine learning is about correlation not causation
  2. https://towardsdatascience.com/why-correlation-does-not-imply-causation-5b99790df07e
  3. TowardsDataScience Blog some more examples explained well

--

--

--

Masters Student | Machine Learning | Artificial Intelligence | Causal Inference | Data Bias | Twitter: @adabhishekdabas

Love podcasts or audiobooks? Learn on the go with our new app.

Recommended from Medium

How we measure our Core Web Vitals — adding a p75 time series chart in DataStudio

Prettify Your Full Stack Projects: Use Open Graph Tags!

A Short History of Data Visualisation

Say Wonderful Things: A Sentiment Analysis of Eurovision Lyrics

Clustering U.S counties by their COVID-19 curves

Introduction to SQL with real life examples: analyze your grocery shopping habits

Finalizing the program for NeurIPS 2019

Story Of Data — Source to Production

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Abhishek Dabas

Abhishek Dabas

Masters Student | Machine Learning | Artificial Intelligence | Causal Inference | Data Bias | Twitter: @adabhishekdabas

More from Medium

Data in the Field: Self

Decision Trees: Gini vs Entropy

Information Theory — Entropy

Synthetic Data Talk with Jonas Christensen on Leaders of Analytics Podcast