4.4 Linear Regression
1. Bivariate Data and Scatter Diagrams
We study pairs of continuous variables $(x, y)$ to observe if they correlate. $x$ is the independent variable, and $y$ is the dependent variable. Plotting these points on a Cartesian plane creates a scatter diagram.
- Linear Correlation: The points follow a straight line.
- Non-linear Correlation: The points follow a curve (e.g., quadratic or exponential).
- Zero Correlation: The points are scattered with no pattern.
- Positive Correlation: As $x$ increases, $y$ increases.
- Negative Correlation: As $x$ increases, $y$ decreases.
EXAMPLE 1
A mathematics exam has two papers: Paper 1 and Paper 2. The scores for 8 students are:
| $x$ (Paper 1) | 55 | 50 | 30 | 70 | 40 | 75 | 60 | 90 |
|---|---|---|---|---|---|---|---|---|
| $y$ (Paper 2) | 50 | 60 | 45 | 65 | 35 | 80 | 75 | 85 |
The scatter diagram reveals a pattern: as $x$ (Paper 1 scores) increases, $y$ (Paper 2 scores) increases.
Therefore, there is a positive linear correlation between the exam scores.
2. Pearson's Correlation Coefficient ($r$)
Pearson's correlation coefficient, denoted as $r$, quantifies the strength of a linear relationship.
- The value of $r$ is bounded: $-1 \le r \le 1$.
- If $r > 0$, the correlation is positive.
- If $r < 0$, the correlation is negative.
- A value of $r = \pm 1$ indicates a perfect linear correlation.
- A value of $r = 0$ indicates no linear correlation.
The strength of the linear correlation depends on how close $|r|$ is to 1.
$|r| \ge 0.75 \implies \text{Strong correlation}$
$0.5 \le |r| < 0.75 \implies \text{Moderate correlation}$
$|r| < 0.5 \implies \text{Weak correlation}$
3. The Regression Line
The regression line of $y$ on $x$ is the line that "best fits" the data points. Its equation is:
- This line passes through the mean point $(\bar{x}, \bar{y})$.
- The regression line predicts the dependent variable $y$ for a given independent variable $x$.
EXAMPLE 2
Consider the data from Example 1. Using a GDC, we determine these parameters:
- Mean Point: $\bar{x} = 58.75$ and $\bar{y} = 61.875$, making the mean point $(\bar{x}, \bar{y}) = (58.75, 61.875)$.
- Pearson's Coefficient: $r = 0.814$. This confirms a strong positive linear correlation.
- Regression Line: The calculator gives $a = 0.793$ and $b = 15.3$. The regression line of $y$ on $x$ is:
$y = 0.793x + 15.3$
If a student scores 65 on Paper 1, we estimate their Paper 2 score by substituting $x = 65$ into the regression line:
A prediction is reliable if two conditions are met:
- The correlation is strong ($|r| \ge 0.75$).
- The input $x$ is within the range of the original data (interpolation). In our example, $x$ spans from $30$ to $90$. Since $65$ is within this interval, the prediction of $67$ is reliable.
Predicting a score for $x = 95$ falls outside the data range (extrapolation) and is unreliable.
4. Regression Line of $x$ on $y$
When $x$ and $y$ have no clear "independent vs. dependent" distinction (e.g., heights and weights), we can construct two regression lines:
-
The Regression Line of $y$ on $x$: $y = ax + b$.
Used to predict $y$ given $x$. -
The Regression Line of $x$ on $y$: $x = cy + d$.
Used to predict $x$ given $y$.
Both regression lines intersect at the mean point $(\bar{x}, \bar{y})$.
EXAMPLE 3
Using the data from Example 1, to predict a Paper 1 score from a Paper 2 score, we calculate the regression line of $x$ on $y$.
- Using a GDC, the regression line of $x$ on $y$ is:
$x = 0.835y + 7.07$
- If a student scores $y = 70$ on Paper 2, we predict their Paper 1 score by substituting $y = 70$ into this equation:
$x = 0.835(70) + 7.07 = 65.52 \approx 66$