4.4 Regression

1. Bivariate Data and Scatter Diagrams

In certain scenarios, we study pairs of continuous variables $(x, y)$ to observe if they are correlated. $x$ is termed the independent variable, while $y$ is the dependent variable. The resulting points are plotted on a Cartesian plane, creating what is known as a scatter diagram.

Correlation Assessment: We evaluate correlation visually based on the pattern of the scatter diagram:

Linear Correlation: The points roughly follow a straight line.
Non-linear Correlation: The points roughly follow a curve (e.g., quadratic or exponential).
Zero Correlation: The points are randomly scattered with no discernible pattern.

Direction of Correlation (for Linear cases):

Positive Correlation: As $x$ increases, $y$ tends to increase.
Negative Correlation: As $x$ increases, $y$ tends to decrease.

EXAMPLE 1

A mathematics exam comprises two papers: Paper 1 and Paper 2. For a group of 8 students, the scores achieved are as follows:

$x$ (Paper 1)	55	50	30	70	40	75	60	90
$y$ (Paper 2)	50	60	45	65	35	80	75	85

The scatter diagram reveals a clear pattern: as $x$ (Paper 1 scores) increases, $y$ (Paper 2 scores) correspondingly increases.

Therefore, we can conclude there is a positive linear correlation between the two exam scores.

2. Pearson's Correlation Coefficient ($r$)

A GDC (Graphic Display Calculator) provides the Pearson's correlation coefficient, denoted as $r$, which numerically quantifies the strength of a linear relationship.

The value of $r$ is strictly bounded: $-1 \le r \le 1$.
If $r > 0$, the correlation is positive.
If $r < 0$, the correlation is negative.
A value of $r = \pm 1$ indicates a perfect straight-line relationship.
A value of $r = 0$ indicates absolutely no linear correlation.

The strength of the linear correlation is assessed by how closely $|r|$ approaches 1.
$|r| \ge 0.75 \implies \text{Strong correlation}$
$0.5 \le |r| < 0.75 \implies \text{Moderate correlation}$
$|r| < 0.5 \implies \text{Weak correlation}$

3. The Regression Line (Line of Best Fit)

The regression line of $y$ on $x$ is the specific straight line that visually "best fits" the scatter of data points. Its mathematical equation takes the form:

$y = ax + b$

This precise line always strictly passes through the mean point $(\bar{x}, \bar{y})$.
The primary utility of the regression line is to predict the value of the dependent variable $y$ for a given value of the independent variable $x$.

EXAMPLE 2

Consider the data from Example 1 (Paper 1 and Paper 2 scores). Using a GDC, we can determine the following statistical parameters:

Mean Point: $\bar{x} = 58.75$ and $\bar{y} = 61.875$. The mean point is $(\bar{x}, \bar{y}) = (58.75, 61.875)$.
Pearson's Coefficient: $r = 0.814$. This confirms a strong positive linear correlation between the two papers.
Regression Line: The calculator outputs $a = 0.793$ and $b = 15.3$. Thus, the equation of the regression line of $y$ on $x$ is:
$y = 0.793x + 15.3$

Prediction Using the Regression Line:

Suppose a new student scores 65 on Paper 1. We can estimate their expected score on Paper 2 by substituting $x = 65$ into the regression line equation:

$y = 0.793(65) + 15.3 = 66.845 \approx 67$

Reliability of Predictions:

A prediction is generally considered reliable if two conditions are met:

The correlation is strong ($|r| \ge 0.75$).
The input value $x$ lies within the established range of the original data. In our example, the $x$ values span from $30$ to $90$. Since $65$ falls safely within this interval (interpolation), the prediction of $67$ is considered reliable.

If we attempted to predict a score for $x = 95$, it would fall outside the data range (extrapolation). Such predictions are inherently unreliable.

4. Regression Line of $x$ on $y$

In certain contexts, both variables $x$ and $y$ are randomly generated without a clear "independent vs. dependent" distinction (e.g., measuring the heights and weights of students). In these scenarios, we can construct two distinct regression lines:

The Regression Line of $y$ on $x$: $y = ax + b$.
Used specifically to predict $y$ when a value of $x$ is provided.
The Regression Line of $x$ on $y$: $x = cy + d$.
Used specifically to predict $x$ when a value of $y$ is provided.

Crucially, both regression lines intersect exactly at the mean point $(\bar{x}, \bar{y})$.

EXAMPLE 3

Using the data from Example 1, assume we want to predict a student's score on Paper 1 based on their known score on Paper 2. We must calculate the regression line of $x$ on $y$.

Using a GDC, the regression line of $x$ on $y$ is found to be:
$x = 0.835y + 7.07$
If a student scores $y = 70$ on Paper 2, we predict their Paper 1 score by substituting $y = 70$ into this specific equation:
$x = 0.835(70) + 7.07 = 65.52 \approx 66$

Important Warning: You cannot accurately predict $x$ by substituting $y=70$ into the original $y$ on $x$ regression line ($70 = 0.793x + 15.3$). The two lines minimize different measurement residuals and are not algebraically interchangeable. You must use the line that explicitly isolates the variable you wish to predict.