4.1 Basic Concepts of Statistics

1. Introduction to Statistics and Data Types

In Statistics, we deal with data collection, presentation, analysis, and the interpretation of results. To gather information and draw conclusions, we work with populations and samples:

Population vs. Sample:
  • Population: The entire list of a specified group.
  • Sample: A subset of the Population.
We usually investigate a small, representative sample of the population to draw valid conclusions for the whole population itself.

Types of Numerical Data:

Numerical data can be classified as either discrete or continuous:

  • Discrete Data: Formed by a finite or numerable set.
    Examples: $\{10, 20, 30\}$ or $\{0, 1, 2, 3, \dots\}$
  • Continuous Data: Formed by an interval or a continuous range of real numbers.
    Examples: $[40, 100]$ or $\mathbb{R}$

2. Data Organization and Representation

Collected data can be organized and presented visually in several ways, such as frequency tables, pie charts, and bar graphs.

EXAMPLE 1: Data Presentation Methods

Suppose we have a collection of colored balls consisting of 13 Blue, 8 Green, 10 Red, and 3 Yellow balls. The total frequency can be computed as: $$N = 13 + 8 + 10 + 3 = 34$$

(a) Frequency Table:
Colored Balls Frequency ($f$)
Blue 13
Green 8
Red 10
Yellow 3
Total 34
(b) Pie Chart:
13 8 10 3 Blue Green Red Yellow
(c) Bar Graph (for Discrete Data):
14 12 10 8 6 4 2 0 Blue Green Red Yellow

3. Sampling Techniques and Avoiding Bias

In statistical analysis, it is vital to obtain an **unbiased sample** to draw accurate conclusions about the parent population. Selecting a skewed subset introduces systemic error.

EXAMPLE 2: Analysis of Sampling Methods

Suppose we have a population of $100,000$ individuals and want to select a sample of $1,000$ people. If we simply pick the first $1,000$ people on an alphabetical roster, or select the youngest $1,000$ individuals, our data will suffer from clear selection bias.

1. Simple Random Sampling:

Every member of the population has an equal probability of being selected. For example, we put all $100,000$ names into a virtual hat and pick out $1,000$.

Evaluation: Perfectly fair, but it can be highly time-consuming compared to other structural alternatives.

2. Systematic Sampling:

We select members from an ordered list at uniform intervals. We first compute the sampling period ($k$): $$k = \dfrac{\text{Population Size}}{\text{Sample Size}} = \dfrac{100,000}{1,000} = 100$$ We choose a random index from the first $100$ entries (e.g., the $20^{\text{th}}$ person) and then pick every $100^{\text{th}}$ person thereafter (i.e., $20^{\text{th}}$, $120^{\text{th}}$, $220^{\text{th}}$, $\dots$).

Evaluation: highly efficient, but vulnerability arises if a periodic sequence exists in the population. If the list is grouped in chunks of 100 where the first entry is always the department manager, picking every $100^{\text{th}}$ person could output a biased sample consisting entirely of managers or missing them completely.

3. Stratified Sampling:

The population is divided into distinct subgroups (strata) based on a criteria (e.g., separating by gender into men and women, or separating by age into under and over 40 years old). We then pull a sample from each separate group.


4. Quota Sampling:

This method mirrors stratified sampling, but the samples from each subgroup are chosen to be proportional to the actual distribution within the wider population. Selection within each category is non-random, typically relying on convenience.