4.1 Basic Concepts of Statistics
1. Introduction to Statistics and Data Types
In Statistics, we deal with data collection, presentation, analysis, and the interpretation of results. To gather information and draw conclusions, we work with populations and samples:
- Population: The entire list of a specified group.
- Sample: A subset of the Population.
Numerical data can be classified as either discrete or continuous:
- Discrete Data: Formed by a finite or numerable set.
Examples: $\{10, 20, 30\}$ or $\{0, 1, 2, 3, \dots\}$ - Continuous Data: Formed by an interval or a continuous range of real numbers.
Examples: $[40, 100]$ or $\mathbb{R}$
2. Data Organization and Representation
Collected data can be organized and presented visually in several ways, such as frequency tables, pie charts, and bar graphs.
EXAMPLE 1: Data Presentation Methods
Suppose we have a collection of colored balls consisting of 13 Blue, 8 Green, 10 Red, and 3 Yellow balls. The total frequency can be computed as: $$N = 13 + 8 + 10 + 3 = 34$$
3. Sampling Techniques and Avoiding Bias
In statistical analysis, it is vital to obtain an **unbiased sample** to draw accurate conclusions about the parent population. Selecting a skewed subset introduces systemic error.
EXAMPLE 2: Analysis of Sampling Methods
Suppose we have a population of $100,000$ individuals and want to select a sample of $1,000$ people. If we simply pick the first $1,000$ people on an alphabetical roster, or select the youngest $1,000$ individuals, our data will suffer from clear selection bias.
Every member of the population has an equal probability of being selected. For example, we put all $100,000$ names into a virtual hat and pick out $1,000$.
Evaluation: Perfectly fair, but it can be highly time-consuming compared to other structural alternatives.We select members from an ordered list at uniform intervals. We first compute the sampling period ($k$): $$k = \dfrac{\text{Population Size}}{\text{Sample Size}} = \dfrac{100,000}{1,000} = 100$$ We choose a random index from the first $100$ entries (e.g., the $20^{\text{th}}$ person) and then pick every $100^{\text{th}}$ person thereafter (i.e., $20^{\text{th}}$, $120^{\text{th}}$, $220^{\text{th}}$, $\dots$).
Evaluation: highly efficient, but vulnerability arises if a periodic sequence exists in the population. If the list is grouped in chunks of 100 where the first entry is always the department manager, picking every $100^{\text{th}}$ person could output a biased sample consisting entirely of managers or missing them completely.The population is divided into distinct subgroups (strata) based on a criteria (e.g., separating by gender into men and women, or separating by age into under and over 40 years old). We then pull a sample from each separate group.
This method mirrors stratified sampling, but the samples from each subgroup are chosen to be proportional to the actual distribution within the wider population. Selection within each category is non-random, typically relying on convenience.