We hear the terms “correlation” and “causation” a lot, but what do they actually mean?
Correlation: defines how two variables relate
Causation: implies that one variable causes another variable to change. For example, we can confidently conclude that more rain causes more people to acquire umbrellas.
In this post, I will explore the meaning of the terms and try to explain a way of deciding how they relate. I will use a real-world example to explore and explain.
Survey completion rate correlations
Echo Mobile helps organizations in Africa engage, influence, and understand their target audience via mobile channels. Our core product is a web-based SaaS platform that, among many other things, enables users to design, send and analyze the results of mobile surveys. Our users can deploy their surveys via SMS (Short Messaging Service), USSD (Unstructured Supplementary Service Data), IVR (Interactive Voice Response), and Android apps, but SMS is the most heavily used channel.
Surveys are key to our overall mission, as they give our users a tool to better understand their target audiences — usually their customers or beneficiaries. To optimize the effectiveness of this tool, one thing that we really wanted to do was identify key factors that lead to more people completing surveys sent by our users from the Echo platform. This would enable us to advise our users on how to get more value from our platform through better engagement and understanding of their audiences.
The completion rate of a survey is the percentage of people who complete a survey after being invited to take part in it. We came up with different factors that we thought could effect the completion rate of surveys:
- post_incentive: The incentive (a small amount of money or airtime) offered after completing the survey
- invite_day_of_month: The date of the month a respondent was invited to the survey
- invite_day_of_the_week: The day of the week a respondent was asked to take part in the survey
- invite_hour: The hour of the day the respondent was invited to the survey
- num_questions: The number of questions in the survey
- reminded: whether the respondent was reminded to complete the survey or not
- channel: The manner in which the survey was done. These were either by use of SMS, USSD, IVR, web, or Android app. SMS is the most popular channel and accounts for over 90% of surveys
- completion_rate: Of those invited to a survey, the percentage that completed
We used the performance of surveys deployed from the beginning of 2017 to August of 2017 to look for the correlations between the sample factors above. The correlations between the factors are shown in the table below. Since the focus was more on how the completion rate relates with other factors, I will focus on those relationships more.
The bigger the correlation magnitude, the stronger the correlation relationship. A positive correlation indicates that when one factor is increased the other should also increase. For a negative correlation value, the relationship is inverse. When one increases, the other decreases.
The rows of the table are arranged in a descending order of the correlation between completion rate and other factors. Looking at the table, invite_hour with a positive correlation of 0.25 is the factor with strongest correlation with the completion rate. It is then followed by reminded while invite_day_of_the_month is the most negatively correlated with the completion_rate. The correlation between any other factors can also be obtained from the table, for example the correlation between number_of_questions and reminded is 0.05.
Survey completion causations?
The findings above can lead to incorrect conclusions if one is not careful. For example, a conclusion can be made that the invite hour with a correlation of 0.25 has the highest causal influence on the completion_rate of a survey. As a result, you might start trying to find the right time to send out surveys with the hope of getting more of them completed. With this mentality, it might be concluded that some invite hour is the optimum time to send out a survey. But that would be to hold to the (incorrect) idea that correlation implies causation.
The high correlation may mean that either one factor causes the other, the factors jointly cause each other, both factors are caused by the same separate third factor, or even that the correlation is as a result of coincidence.
We can, therefore, see that correlation does not always imply causation. With careful investigation, however, it is possible to more confidently conclude whether correlation implies that one variable causes the other.
How can we verify if correlation might imply causation?
1. Use statistically sound techniques to determine the relationship.
Ensure that you use statistically legitimate methods to find the correlation. These include:
- use of variables that correctly quantify the relationship.
- make sure there are no outliers .
- ensure the sample is an appropriate representation of the population.
- use of an appropriate correlation coefficient based on the scales of the relationship metrics.
2. Explain the relationships found
- exposure always precedes the outcome. If A is supposed to cause B, check that A always occurs before B.
- check if the relationship ties in with other existing theories.
- check if the proposed relationship is similar to other relationships in related fields.
- check if there is no other relationship that can explain the relationship. In the case above, a proper explanation for the headaches could be drinking instead of sleeping with shoes.
3. Validate the relationships
- Conditions 1 and 2 above should be tested to determine if they are true or false. The common methods of testing are experiments and checking for consistency of the relationship. An experiment usually requires a model of the relationship, a testable hypothesis based on the model, incorporation of variance control measures, collection of suitable metrics for the relationship, and an appropriate analysis. Experiments done several times should lead to consistent conclusions.
We have not yet carried out these tests on our completion rate correlations. So we don’t yet know, for example, whether particular invite hours cause higher completion rates — only whether they are correlated.
We need to be careful before concluding that a particular relationship implies causation. It is generally better not to have a conclusion than to land on an incorrect one which might lead to wrong actions being taken!