4.4 Regression
1. Bivariate Data and Scatter Diagrams
In certain scenarios, we study pairs of continuous variables $(x, y)$ to observe if they are correlated. $x$ is termed the independent variable, while $y$ is the dependent variable. The resulting points are plotted on a Cartesian plane, creating what is known as a scatter diagram.
- Linear Correlation: The points roughly follow a straight line.
- Non-linear Correlation: The points roughly follow a curve (e.g., quadratic or exponential).
- Zero Correlation: The points are randomly scattered with no discernible pattern.
- Positive Correlation: As $x$ increases, $y$ tends to increase.
- Negative Correlation: As $x$ increases, $y$ tends to decrease.
EXAMPLE 1
A mathematics exam comprises two papers: Paper 1 and Paper 2. For a group of 8 students, the scores achieved are as follows:
| $x$ (Paper 1) | 55 | 50 | 30 | 70 | 40 | 75 | 60 | 90 |
|---|---|---|---|---|---|---|---|---|
| $y$ (Paper 2) | 50 | 60 | 45 | 65 | 35 | 80 | 75 | 85 |
The scatter diagram reveals a clear pattern: as $x$ (Paper 1 scores) increases, $y$ (Paper 2 scores) correspondingly increases.
Therefore, we can conclude there is a positive linear correlation between the two exam scores.
2. Pearson's Correlation Coefficient ($r$)
A GDC (Graphic Display Calculator) provides the Pearson's correlation coefficient, denoted as $r$, which numerically quantifies the strength of a linear relationship.
- The value of $r$ is strictly bounded: $-1 \le r \le 1$.
- If $r > 0$, the correlation is positive.
- If $r < 0$, the correlation is negative.
- A value of $r = \pm 1$ indicates a perfect straight-line relationship.
- A value of $r = 0$ indicates absolutely no linear correlation.
The strength of the linear correlation is assessed by how closely $|r|$ approaches 1.
$|r| \ge 0.75 \implies \text{Strong correlation}$
$0.5 \le |r| < 0.75 \implies \text{Moderate correlation}$
$|r| < 0.5 \implies \text{Weak correlation}$
3. The Regression Line (Line of Best Fit)
The regression line of $y$ on $x$ is the specific straight line that visually "best fits" the scatter of data points. Its mathematical equation takes the form:
- This precise line always strictly passes through the mean point $(\bar{x}, \bar{y})$.
- The primary utility of the regression line is to predict the value of the dependent variable $y$ for a given value of the independent variable $x$.
EXAMPLE 2
Consider the data from Example 1 (Paper 1 and Paper 2 scores). Using a GDC, we can determine the following statistical parameters:
- Mean Point: $\bar{x} = 58.75$ and $\bar{y} = 61.875$. The mean point is $(\bar{x}, \bar{y}) = (58.75, 61.875)$.
- Pearson's Coefficient: $r = 0.814$. This confirms a strong positive linear correlation between the two papers.
- Regression Line: The calculator outputs $a = 0.793$ and $b = 15.3$. Thus, the equation of the regression line of $y$ on $x$ is:
$y = 0.793x + 15.3$
Suppose a new student scores 65 on Paper 1. We can estimate their expected score on Paper 2 by substituting $x = 65$ into the regression line equation:
A prediction is generally considered reliable if two conditions are met:
- The correlation is strong ($|r| \ge 0.75$).
- The input value $x$ lies within the established range of the original data. In our example, the $x$ values span from $30$ to $90$. Since $65$ falls safely within this interval (interpolation), the prediction of $67$ is considered reliable.
If we attempted to predict a score for $x = 95$, it would fall outside the data range (extrapolation). Such predictions are inherently unreliable.
4. Regression Line of $x$ on $y$
In certain contexts, both variables $x$ and $y$ are randomly generated without a clear "independent vs. dependent" distinction (e.g., measuring the heights and weights of students). In these scenarios, we can construct two distinct regression lines:
-
The Regression Line of $y$ on $x$: $y = ax + b$.
Used specifically to predict $y$ when a value of $x$ is provided. -
The Regression Line of $x$ on $y$: $x = cy + d$.
Used specifically to predict $x$ when a value of $y$ is provided.
Crucially, both regression lines intersect exactly at the mean point $(\bar{x}, \bar{y})$.
EXAMPLE 3
Using the data from Example 1, assume we want to predict a student's score on Paper 1 based on their known score on Paper 2. We must calculate the regression line of $x$ on $y$.
- Using a GDC, the regression line of $x$ on $y$ is found to be:
$x = 0.835y + 7.07$
- If a student scores $y = 70$ on Paper 2, we predict their Paper 1 score by substituting $y = 70$ into this specific equation:
$x = 0.835(70) + 7.07 = 65.52 \approx 66$