4.1 Basic Concepts of Statistics & 4.2 Measures of Central Tendency and Spread

Understanding Data and Samples

Statistics concerns the collection, presentation and analysis of data. A population is the entire set of items under investigation, whereas a sample is a subset drawn from that population. Because it is often impractical to measure every member of a population, we work with samples to draw conclusions about the whole.

Data values may be discrete (taken from a finite or countable set) or continuous (taking any real value in an interval). Organising data helps us to see patterns; common representations include frequency tables, pie charts, bar graphs, histograms and stem‑and‑leaf diagrams. The example below illustrates these ideas.

Population vs sample: the population is the whole group; a sample is a smaller subset drawn to estimate population characteristics.
Discrete vs continuous: discrete data arise from counting (e.g. number of children); continuous data arise from measuring and can take any value in an interval.
Data displays: frequency tables, pie charts, bar graphs (for discrete data), histograms (for continuous data) and stem‑and‑leaf diagrams help organise and visualise data.

Sampling Techniques

When selecting a sample it is important to avoid bias. Several sampling methods exist:

Simple random sampling: each member has the same probability of selection; this method is fair but may be time‑consuming.
Systematic sampling: pick a random starting point and then select every k-th member; this is efficient but may be biased if a periodic pattern exists.
Stratified sampling: divide the population into subgroups (e.g. by gender) and select samples within each subgroup.
Quota sampling: similar to stratified sampling but the sample sizes are proportional to the subgroup sizes in the population.

Each method has advantages and disadvantages: simple random sampling is unbiased but inefficient, while systematic sampling may introduce bias if the population has a regular ordering.

Measures of Central Tendency

To describe a data set with a single representative value we use the mean, median and mode. These are known as measures of central tendency.

Measure	Definition	Formula / Computation
Mean	Arithmetic average	$\displaystyle \bar{x} = \frac{1}{n}\sum_{i=1}^n x_i$
Median	Middle value when data are ordered	For n data: if n odd, the middle value; if even, average of the two middle values
Mode	Most frequent value	The value that occurs most often

When data values are repeated, the mode may not be unique; the median divides the ordered data into two halves. For example, in the dataset $\{10,20,20,20,30,30,40,50,70,70,80\}$ the mean is $\tfrac{450}{11}\approx40.9$, the median is 30, and the mode is 20.

Measures of Spread

Measures of central tendency are complemented by measures of spread, which indicate how dispersed the data are around the centre. Important measures include the range, inter‑quartile range (IQR), variance and standard deviation.

Range: the difference between the largest and smallest data values, $\mathrm{range} = x_{\max} - x_{\min}$.
Quartiles: $Q_1$ is the median of the lower half, $Q_2$ is the median (the median of the entire set) and $Q_3$ is the median of the upper half. The inter‑quartile range is $\mathrm{IQR} = Q_3 - Q_1$.
Variance: measures average squared deviation from the mean; for sample data $s^2 = \dfrac{1}{n-1}\sum_{i=1}^n (x_i-\bar{x})^2$.
Standard deviation: $s = \sqrt{s^2}$, providing a measure of dispersion in the same units as the data. A larger standard deviation indicates more variability.

An outlier is a value that lies more than $1.5\times\mathrm{IQR}$ below $Q_1$ or above $Q_3$. Outliers merit further investigation as they may reflect measurement errors or interesting features of the data.

Visualising Spread: Box‑and‑Whisker Plot

A box‑and‑whisker plot displays $Q_1$, $Q_2$ (median) and $Q_3$ as the left edge, central line and right edge of a rectangular box. Whiskers extend to the minimum and maximum values that are not outliers. The plot below shows the box‑and‑whisker diagram for the sample dataset used above.

Worked Example: Descriptive Statistics

Example 1 – Describing a Data Set

Consider the eleven data values
$10,20,20,20,30,30,40,50,70,70,80$.

Mean: Sum the values and divide by 11. The total is $450$, giving $\bar{x} = 450/11 \approx 40.9$.
Median: The middle value (6th value when ordered) is $30$.
Mode: The value $20$ occurs three times, more than any other value.
Quartiles: $Q_1 = 20$, $Q_2 = 30$, $Q_3 = 70$ giving $\mathrm{IQR} = 50$. The range is $80 - 10 = 70$.
Standard deviation: Calculating $s^2 = \dfrac{1}{10}\sum (x_i-\bar{x})^2$ yields $s \approx 24.2$ (units).
Outliers: Values below $Q_1 - 1.5\times\mathrm{IQR} = 20 - 75$ or above $Q_3 + 75 = 145$ would be considered outliers. In this data set there are no outliers.

These statistics provide a concise numerical summary and help us interpret the distribution’s shape and spread.

Continue to the next section to learn how to compute statistics for grouped data and display information in frequency tables.