TutorChase logo
IB DP Maths AI SL Study Notes

4.2.3 Interpreting Correlation

Strength and Direction of Correlation

Understanding Strength

The strength of a correlation refers to the degree to which a pair of variables are linearly related.

  • Strong Correlation: A strong correlation is one where the variables have a tight linear trend. If one variable changes, the other variable tends to change in a specific direction.For instance, consider the relationship between the number of hours studied and exam scores. A strong correlation might indicate that more hours of study are associated with higher exam scores. For a deeper understanding of how these relationships are quantified, see our notes on calculating correlation.
  • Weak Correlation: A weak correlation is one where the linear trend is loose. Changes in one variable may correspond to changes in the other, but the association is not consistent.For example, consider the relationship between shoe size and intelligence. There is likely to be no correlation because a change in one variable does not predict a change in the other variable. A practical application of correlation can be found in linear regression, which builds upon the concept of correlation to make predictions.

Direction of Correlation

The direction of correlation refers to the way the variables move in relation to each other.

  • Positive Correlation: This occurs when the variables move in the same direction. If one variable increases, the other variable also increases, and vice versa.

Example Question 1: If the price of a commodity (like gold) increases and we observe that the price of a related stock (like a gold mining company) also increases, what type of correlation is this?Answer: This is a positive correlation because as the price of gold increases, the stock price of the gold mining company also tends to increase.

  • Negative Correlation: This occurs when the variables move in opposite directions. If one variable increases, the other variable decreases, and vice versa.

Example Question 2: If the price of a commodity (like oil) decreases and we observe that the price of a related stock (like an airline company) increases, what type of correlation is this?Answer: This is a negative correlation because as the price of oil decreases, the stock price of the airline company tends to increase, under the assumption that lower oil prices reduce fuel costs for airlines.

Causation vs Correlation

Defining Causation and Correlation

  • Correlation: This simply indicates that there is a relationship between two variables. For more insight into how data is represented and its implications, explore our section on data representation.
  • Causation: This goes a step further by asserting that changes in one variable are responsible for changes in another.

Misinterpreting Correlation

It's crucial to understand that correlation does not imply causation. Just because two variables are correlated does not mean that changes in one variable cause changes in the other. This principle is further elucidated in our notes on measures of spread, which discuss variability in data sets.

Example Question 3: If a study finds a positive correlation between ice cream sales and drowning incidents, does this imply that ice cream sales cause drownings?

Answer: No, correlation does not imply causation. There may be a lurking variable, such as hot weather, which is associated with both increased ice cream sales and an increased likelihood of swimming, thereby increasing the risk of drowning incidents.

Identifying Causation

Causation is typically established through controlled, randomised experiments. If it is ethically and practically possible to manipulate the independent variable and measure the dependent variable while holding all other variables constant, causation can be inferred.

Example Question 4: In a controlled experiment, if a group of plants is given a new fertiliser and they grow larger than a control group not given the fertiliser, can causation be inferred?

Answer: Yes, because the experiment is controlled and randomised, and the only variable manipulated was the application of the fertiliser, it is reasonable to infer that the fertiliser caused the increased growth.

Practical Application: Pearson’s Correlation Coefficient

Calculating Pearson’s Correlation Coefficient

Pearson's correlation coefficient, denoted as r, quantifies the direction and strength of the linear relationship between two quantitative variables.

  • The value of r is always between -1 and 1.
  • r > 0 indicates a positive correlation, r < 0 indicates a negative correlation, and r = 0 indicates no linear correlation. For a detailed guide on using Pearson's r to make informed predictions, refer to our notes on predictions.

Example Question 5: If r = 0.8 between two variables, what does this indicate?

Answer: r = 0.8 indicates a strong, positive linear correlation between the two variables.

Using r to Make Predictions

While r can indicate the strength and direction of a linear relationship, it should be used cautiously for predictions, especially for extrapolation, as correlation does not imply causation.

Example Question 6: If the correlation coefficient between a person’s income and their spending on luxury goods is 0.92, can we predict a person’s spending on luxury goods based on their income?

Answer: While the strong positive correlation suggests that higher income is associated with higher spending on luxury goods, predicting an individual’s spending based solely on their income can be inaccurate due to potential lurking variables.

FAQ

Yes, two variables can be correlated even if there is no apparent logical relationship between them, a phenomenon sometimes referred to as a "spurious correlation". This can occur due to coincidence, where the variables just happen to move together, or due to the presence of lurking variables that influence both variables being studied. It is essential to approach correlations with a critical mindset and consider the practical and theoretical basis for any observed relationship. Statistical correlation should be one of the tools used in data analysis, but it should be complemented with theoretical understanding and further research to validate any findings and explore potential causation.

Spearman’s rank correlation coefficient and Pearson’s correlation coefficient are both measures of the strength and direction of the relationship between two variables. The key difference lies in the type of data and relationship they are designed to assess. Pearson’s correlation coefficient is used for quantitative variables and assesses the linear relationship between them. It calculates the statistical covariance of the two variables divided by the product of their standard deviations. On the other hand, Spearman’s rank correlation coefficient is used for ordinal, interval, or ratio data and assesses monotonic relationships (whether linear or not). It calculates the correlation between the rank orders of the two variables rather than the actual raw data. Spearman’s is particularly useful when data are not normally distributed or have outliers, as it mitigates the impact of extreme values.

The coefficient of determination, denoted as R2, is the square of the correlation coefficient (r). If r represents Pearson’s correlation coefficient, then R2 signifies the proportion of the variance in the dependent variable that is predictable from the independent variable. For instance, if r = 0.8, then R2 = 0.64, meaning that 64% of the variance in one variable is explained by the other variable. It provides a measure of how well observed outcomes are replicated by the model, based on the proportion of total variation of outcomes explained by the model. It is particularly useful in regression analysis as it gives an indication of the quality of the predictive model. A higher R2 indicates that the model explains more of the variability, while a lower R2 indicates less explanatory power. However, a higher R2 does not necessarily mean the model is appropriate or that predictions will be accurate. It should be used alongside other metrics and domain knowledge to assess model validity and reliability.

In observational studies, researchers observe and collect data without intervening or manipulating variables. Being cautious about correlation in this context is crucial because, in observational studies, it is common to encounter lurking variables - unobserved variables that may drive the apparent correlation between the studied variables. Since the researchers do not control or randomise the variables, it is challenging to establish causation. A correlation might be coincidental or influenced by an external factor, and asserting causation without rigorous experimental design can lead to incorrect conclusions and misguided decisions in practical applications. Therefore, while correlations in observational studies can suggest potential relationships, they should be interpreted with caution and ideally followed up with experimental studies to explore causation.

A scatter plot is a graphical representation that uses dots to display values for two different numeric variables. The position of each dot on the horizontal and vertical axis indicates values for an individual data point. Scatter plots are particularly useful in visualising correlation as they allow us to observe general patterns, trends, outliers, and even the strength of the relationship between the variables. If the points cluster along a straight line in an uphill direction, it indicates a positive correlation. If the points cluster along a straight line in a downhill direction, it indicates a negative correlation. A scatter plot where points do not follow any trend indicates no correlation. It provides a visual insight into the relationship between the variables, which can be further analysed using statistical techniques like calculating the Pearson’s correlation coefficient.

Practice Questions

Given the following data points representing the ages and monthly incomes (in hundreds) of a group of individuals: (22, 30), (25, 35), (30, 40), (35, 50), (40, 60), calculate the Pearson’s correlation coefficient (r) and interpret its meaning in the context of the data.

Step 1: Calculate the Means Mean of X (x_bar) = (22 + 25 + 30 + 35 + 40) / 5 = 30.4 Mean of Y (ybar) = (30 + 35 + 40 + 50 + 60) / 5 = 43

Step 2: Apply the Pearson's Correlation Coefficient Formula r = (Sum of (xi - xbar)(yi - ybar)) / sqrt[(Sum of (xi - xbar)2)(Sum of (yi - ybar)2)]

Let's calculate the numerator and the denominator separately:

Numerator: = (22-30.4)(30-43) + (25-30.4)(35-43) + (30-30.4)(40-43) + (35-30.4)(50-43) + (40-30.4)(60-43) = (-8.4)(-13) + (-5.4)(-8) + (-0.4)(-3) + (4.6)(7) + (9.6)(17) = 109.2 + 43.2 + 1.2 + 32.2 + 163.2 = 349

Denominator: = sqrt{[(22-30.4)2 + (25-30.4)2 + (30-30.4)2 + (35-30.4)2 + (40-30.4)2] * [(30-43)2 + (35-43)2 + (40-43)2 + (50-43)2 + (60-43)2]} = sqrt{[(-8.4)2 + (-5.4)2 + (-0.4)2 + (4.6)2 + (9.6)2] * [(-13)2 + (-8)2 + (-3)2 + (7)2 + (17)2]} = sqrt{[70.56 + 29.16 + 0.16 + 21.16 + 92.16] * [169 + 64 + 9 + 49 + 289]} = sqrt{213.2 * 580} = sqrt{123536} = 351.48

Final Calculation: r = 349 / 351.48 r = 0.9929 (rounded to four decimal places)

Interpretation: The calculated r value of 0.9929 indicates a very strong positive linear relationship between Age and Monthly Income. This means that as age increases, the monthly income also tends to increase. The data suggests that older individuals in this group tend to have higher monthly incomes.

A study found a strong positive correlation between the hours spent exercising per week and the overall health score of individuals. However, the researchers are cautious to claim that exercising more leads to a higher health score. Explain why the researchers might be cautious and provide two possible lurking variables that might explain the correlation.

The researchers might be cautious to claim that exercising more leads to a higher health score despite the strong positive correlation because correlation does not imply causation. It is possible that there are lurking variables that explain the correlation between hours spent exercising and health score. One possible lurking variable could be diet. Individuals who exercise more might also tend to have healthier eating habits, which could contribute to a higher health score. Another possible lurking variable could be genetic factors. Some individuals might have genetic predispositions that result in a higher health score and also give them more energy or motivation to exercise. Both diet and genetic factors are variables that could affect both exercise and health score, thereby explaining the observed correlation without implying a causal relationship between the two.

Hire a tutor

Please fill out the form and we'll find a tutor for you.

1/2
About yourself
Alternatively contact us via
WhatsApp, Phone Call, or Email