4.4 Regression

1. Bivariate Data and Scatter Diagrams

In certain scenarios, we study pairs of continuous variables $(x, y)$ to observe if they are correlated. $x$ is termed the independent variable, while $y$ is the dependent variable. The resulting points are plotted on a Cartesian plane, creating what is known as a scatter diagram.

Correlation Assessment: We evaluate correlation visually based on the pattern of the scatter diagram:
  • Linear Correlation: The points roughly follow a straight line.
  • Non-linear Correlation: The points roughly follow a curve (e.g., quadratic or exponential).
  • Zero Correlation: The points are randomly scattered with no discernible pattern.
Direction of Correlation (for Linear cases):
  • Positive Correlation: As $x$ increases, $y$ tends to increase.
  • Negative Correlation: As $x$ increases, $y$ tends to decrease.

EXAMPLE 1

A mathematics exam comprises two papers: Paper 1 and Paper 2. For a group of 8 students, the scores achieved are as follows:

$x$ (Paper 1) 55 50 30 70 40 75 60 90
$y$ (Paper 2) 50 60 45 65 35 80 75 85
x (P1) y (P2) 20 40 60 80 100 20 40 60 80

The scatter diagram reveals a clear pattern: as $x$ (Paper 1 scores) increases, $y$ (Paper 2 scores) correspondingly increases.

Therefore, we can conclude there is a positive linear correlation between the two exam scores.

2. Pearson's Correlation Coefficient ($r$)

A GDC (Graphic Display Calculator) provides the Pearson's correlation coefficient, denoted as $r$, which numerically quantifies the strength of a linear relationship.

  • The value of $r$ is strictly bounded: $-1 \le r \le 1$.
  • If $r > 0$, the correlation is positive.
  • If $r < 0$, the correlation is negative.
  • A value of $r = \pm 1$ indicates a perfect straight-line relationship.
  • A value of $r = 0$ indicates absolutely no linear correlation.

The strength of the linear correlation is assessed by how closely $|r|$ approaches 1.
$|r| \ge 0.75 \implies \text{Strong correlation}$
$0.5 \le |r| < 0.75 \implies \text{Moderate correlation}$
$|r| < 0.5 \implies \text{Weak correlation}$

3. The Regression Line (Line of Best Fit)

The regression line of $y$ on $x$ is the specific straight line that visually "best fits" the scatter of data points. Its mathematical equation takes the form:

$y = ax + b$
  • This precise line always strictly passes through the mean point $(\bar{x}, \bar{y})$.
  • The primary utility of the regression line is to predict the value of the dependent variable $y$ for a given value of the independent variable $x$.

EXAMPLE 2

Consider the data from Example 1 (Paper 1 and Paper 2 scores). Using a GDC, we can determine the following statistical parameters:

  • Mean Point: $\bar{x} = 58.75$ and $\bar{y} = 61.875$. The mean point is $(\bar{x}, \bar{y}) = (58.75, 61.875)$.
  • Pearson's Coefficient: $r = 0.814$. This confirms a strong positive linear correlation between the two papers.
  • Regression Line: The calculator outputs $a = 0.793$ and $b = 15.3$. Thus, the equation of the regression line of $y$ on $x$ is:
    $y = 0.793x + 15.3$
Prediction Using the Regression Line:

Suppose a new student scores 65 on Paper 1. We can estimate their expected score on Paper 2 by substituting $x = 65$ into the regression line equation:

$y = 0.793(65) + 15.3 = 66.845 \approx 67$
Reliability of Predictions:

A prediction is generally considered reliable if two conditions are met:

  1. The correlation is strong ($|r| \ge 0.75$).
  2. The input value $x$ lies within the established range of the original data. In our example, the $x$ values span from $30$ to $90$. Since $65$ falls safely within this interval (interpolation), the prediction of $67$ is considered reliable.

If we attempted to predict a score for $x = 95$, it would fall outside the data range (extrapolation). Such predictions are inherently unreliable.

4. Regression Line of $x$ on $y$

In certain contexts, both variables $x$ and $y$ are randomly generated without a clear "independent vs. dependent" distinction (e.g., measuring the heights and weights of students). In these scenarios, we can construct two distinct regression lines:

  • The Regression Line of $y$ on $x$: $y = ax + b$.
    Used specifically to predict $y$ when a value of $x$ is provided.
  • The Regression Line of $x$ on $y$: $x = cy + d$.
    Used specifically to predict $x$ when a value of $y$ is provided.

Crucially, both regression lines intersect exactly at the mean point $(\bar{x}, \bar{y})$.

EXAMPLE 3

Using the data from Example 1, assume we want to predict a student's score on Paper 1 based on their known score on Paper 2. We must calculate the regression line of $x$ on $y$.

  • Using a GDC, the regression line of $x$ on $y$ is found to be:
    $x = 0.835y + 7.07$
  • If a student scores $y = 70$ on Paper 2, we predict their Paper 1 score by substituting $y = 70$ into this specific equation:
    $x = 0.835(70) + 7.07 = 65.52 \approx 66$
Important Warning: You cannot accurately predict $x$ by substituting $y=70$ into the original $y$ on $x$ regression line ($70 = 0.793x + 15.3$). The two lines minimize different measurement residuals and are not algebraically interchangeable. You must use the line that explicitly isolates the variable you wish to predict.