4.4 Linear Regression

1. Bivariate Data and Scatter Diagrams

We study pairs of continuous variables $(x, y)$ to observe if they correlate. $x$ is the independent variable, and $y$ is the dependent variable. Plotting these points on a Cartesian plane creates a scatter diagram.

Correlation Assessment: We evaluate correlation visually based on the scatter diagram's pattern:
  • Linear Correlation: The points follow a straight line.
  • Non-linear Correlation: The points follow a curve (e.g., quadratic or exponential).
  • Zero Correlation: The points are scattered with no pattern.
Direction of Correlation (for linear cases):
  • Positive Correlation: As $x$ increases, $y$ increases.
  • Negative Correlation: As $x$ increases, $y$ decreases.

EXAMPLE 1

A mathematics exam has two papers: Paper 1 and Paper 2. The scores for 8 students are:

$x$ (Paper 1) 55 50 30 70 40 75 60 90
$y$ (Paper 2) 50 60 45 65 35 80 75 85
x (P1) y (P2) 20 40 60 80 100 20 40 60 80

The scatter diagram reveals a pattern: as $x$ (Paper 1 scores) increases, $y$ (Paper 2 scores) increases.

Therefore, there is a positive linear correlation between the exam scores.

2. Pearson's Correlation Coefficient ($r$)

Pearson's correlation coefficient, denoted as $r$, quantifies the strength of a linear relationship.

  • The value of $r$ is bounded: $-1 \le r \le 1$.
  • If $r > 0$, the correlation is positive.
  • If $r < 0$, the correlation is negative.
  • A value of $r = \pm 1$ indicates a perfect linear correlation.
  • A value of $r = 0$ indicates no linear correlation.

The strength of the linear correlation depends on how close $|r|$ is to 1.
$|r| \ge 0.75 \implies \text{Strong correlation}$
$0.5 \le |r| < 0.75 \implies \text{Moderate correlation}$
$|r| < 0.5 \implies \text{Weak correlation}$

3. The Regression Line

The regression line of $y$ on $x$ is the line that "best fits" the data points. Its equation is:

$y = ax + b$
  • This line passes through the mean point $(\bar{x}, \bar{y})$.
  • The regression line predicts the dependent variable $y$ for a given independent variable $x$.

EXAMPLE 2

Consider the data from Example 1. Using a GDC, we determine these parameters:

  • Mean Point: $\bar{x} = 58.75$ and $\bar{y} = 61.875$, making the mean point $(\bar{x}, \bar{y}) = (58.75, 61.875)$.
  • Pearson's Coefficient: $r = 0.814$. This confirms a strong positive linear correlation.
  • Regression Line: The calculator gives $a = 0.793$ and $b = 15.3$. The regression line of $y$ on $x$ is:
    $y = 0.793x + 15.3$
Prediction Using the Regression Line:

If a student scores 65 on Paper 1, we estimate their Paper 2 score by substituting $x = 65$ into the regression line:

$y = 0.793(65) + 15.3 = 66.845 \approx 67$
Reliability of Predictions:

A prediction is reliable if two conditions are met:

  1. The correlation is strong ($|r| \ge 0.75$).
  2. The input $x$ is within the range of the original data (interpolation). In our example, $x$ spans from $30$ to $90$. Since $65$ is within this interval, the prediction of $67$ is reliable.

Predicting a score for $x = 95$ falls outside the data range (extrapolation) and is unreliable.

4. Regression Line of $x$ on $y$

When $x$ and $y$ have no clear "independent vs. dependent" distinction (e.g., heights and weights), we can construct two regression lines:

  • The Regression Line of $y$ on $x$: $y = ax + b$.
    Used to predict $y$ given $x$.
  • The Regression Line of $x$ on $y$: $x = cy + d$.
    Used to predict $x$ given $y$.

Both regression lines intersect at the mean point $(\bar{x}, \bar{y})$.

EXAMPLE 3

Using the data from Example 1, to predict a Paper 1 score from a Paper 2 score, we calculate the regression line of $x$ on $y$.

  • Using a GDC, the regression line of $x$ on $y$ is:
    $x = 0.835y + 7.07$
  • If a student scores $y = 70$ on Paper 2, we predict their Paper 1 score by substituting $y = 70$ into this equation:
    $x = 0.835(70) + 7.07 = 65.52 \approx 66$
Warning: Do not predict $x$ by substituting $y = 70$ into the $y$ on $x$ regression line ($70 = 0.793x + 15.3$). The lines minimize different residuals and are not algebraically interchangeable. Use the line that isolates the variable you want to predict.