IB DP Maths AI HL Study Notes

4.1.1 Data Collection

Sampling Methods

Sampling is a technique where a subset of elements is chosen from a larger set, known as a population, to estimate certain properties of the population. Let’s delve deeper into the various methods of sampling.

Simple Random Sampling

  • Definition: Every member of the population has an equal chance of being selected.
  • Example: Selecting 50 students from a school of 1000 using a random number generator.
  • Advantages: Minimises bias and is easy to analyse.
  • Disadvantages: Not suitable for large populations and may not represent subgroups effectively.

In-depth: Simple random sampling is akin to a lottery. If we were to select a sample of 50 students from a school of 1000, each student should have an equal chance, i.e., a 1 in 20 chance of being selected. This method is most effective when the population is homogeneous, where each member is similar to the next.

Stratified Sampling

  • Definition: The population is divided into subgroups (strata) and random samples are taken from each stratum.
  • Example: In a survey of 1000 people, ensuring equal representation from different age groups.
  • Advantages: Ensures representation from all subgroups.
  • Disadvantages: Can be complex to administer and requires detailed population knowledge.

In-depth: Stratified sampling is particularly useful when we anticipate variations between different strata or subgroups within a population. For instance, if we were to survey consumer habits across different age groups, stratified sampling would ensure that each age group is adequately represented, thereby providing more reliable and nuanced data.

Cluster Sampling

  • Definition: The population is divided into clusters, and a random sample of clusters is chosen. All members from selected clusters are included in the sample.
  • Example: Selecting 2 classes at random in a school to participate in a survey.
  • Advantages: Cost-effective and practical for large populations.
  • Disadvantages: Can introduce bias if clusters are not homogeneous.

In-depth: Imagine a large population spread across different geographical locations. Conducting a survey in each location might be logistically challenging and expensive. Cluster sampling allows us to select a few locations (clusters) randomly and survey all members within those selected clusters, making it a cost-effective method.

Systematic Sampling

  • Definition: Every nth item is selected from a list after a random start.
  • Example: Selecting every 5th person entering a shop.
  • Advantages: Simple and ensures evenly spread samples.
  • Disadvantages: Can introduce bias if the list has a pattern.

In-depth: Systematic sampling involves a systematic approach – select every nth item after a random start. For instance, if we were to survey shoppers, we might select every 10th shopper entering a store. However, caution must be exercised to ensure that the sampling interval does not introduce bias by coinciding with a pattern in the population.

Convenience Sampling

  • Definition: Choosing data that are easiest to obtain.
  • Example: Surveying people in a nearby location.
  • Advantages: Easy and inexpensive.
  • Disadvantages: Highly prone to bias and not reliable.

In-depth: Convenience sampling, as the name suggests, is convenient and easy but is the least reliable method. It involves choosing samples that are easiest to obtain, which often leads to highly biased data. For instance, if we were to survey individuals in a single neighbourhood about city-wide issues, the results would likely not be representative of the entire city’s population.

Bias in Data Collection

Bias refers to systematic errors that skew results away from the truth. In the context of data collection, various forms of bias can creep into the results, thereby reducing the reliability and validity of the data.

Selection Bias

  • Definition: Occurs when the sample obtained is not representative of the population intended to be analysed.
  • Example: Surveying only daytime shoppers about a retail store.
  • Impact: The results will not accurately reflect the entire customer base.

In-depth: Selection bias can significantly skew results and is often a result of a flawed sampling method. For instance, if a survey about shopping habits is conducted only on weekdays and during working hours, the sample might disproportionately represent non-working individuals or those who shop during those specific times, thereby not accurately reflecting the entire customer base.

Measurement Bias

  • Definition: Arises when data is collected inaccurately, reflecting systematic variance from the true values.
  • Example: Using a faulty scale to measure weight.
  • Impact: The data collected will not be reliable.

In-depth: Measurement bias can be introduced through faulty equipment or flawed data collection methods. For instance, if a scale used to measure weight in a scientific study is not calibrated correctly, all the measurements will be off, leading to inaccurate results and potentially incorrect conclusions.

Confirmation Bias

  • Definition: The tendency to search for, interpret, and remember information that confirms one’s preconceptions.
  • Example: Ignoring data that contradicts pre-established hypotheses.
  • Impact: Leads to one-sided conclusions and does not provide a holistic view.

In-depth: Confirmation bias is a psychological bias where researchers might subconsciously select or interpret data in a way that confirms their hypotheses or beliefs. It is crucial to approach data collection and analysis with an open mind and be willing to accept data even if it contradicts pre-established beliefs or hypotheses.

Non-response Bias

  • Definition: Occurs when participants chosen for a survey do not respond, causing the sample to be non-representative.
  • Example: If only satisfied customers respond to a feedback survey.
  • Impact: The results will not accurately depict the overall customer sentiment.

In-depth: Non-response bias occurs when individuals selected for a sample do not respond, and their non-response is related to the phenomena being studied. For instance, if a customer satisfaction survey is sent out and primarily satisfied customers respond because they are more engaged or have had a positive experience, the results will be skewed positively and will not accurately reflect the overall customer sentiment.


Pilot testing, or conducting a small-scale study before the main research, is vital to identify potential issues in the data collection process. It allows researchers to test their methods, instruments, and protocols on a smaller scale, ensuring that they are effective and reliable. Through pilot testing, researchers can identify ambiguities or problems with survey questions, check the feasibility of their sampling plan, and uncover logistical challenges in data collection. This preliminary step helps to refine the data collection process, making necessary adjustments to mitigate issues in the main study, thereby enhancing the reliability and validity of the research findings.

Outliers, or data points that significantly deviate from the rest, can substantially impact data collected and its subsequent analysis. They can skew measures of central tendency, particularly the mean, and impact the measures of dispersion, like the range or standard deviation, making data appear more spread out than it actually is. In statistical analyses, outliers can influence the results, for instance, affecting the slope in regression analysis or leading to erroneous conclusions about the relationships between variables. Identifying and addressing outliers is crucial to ensure that the data analysis is accurate and that the results are representative of the population being studied.

Ethics plays a crucial role in data collection by ensuring that the methods and practices used are morally sound and adhere to regulatory and institutional guidelines. Ethical considerations in data collection include ensuring privacy and confidentiality of participants, obtaining informed consent, ensuring transparency in data use, and avoiding harm to participants. Ethical practices ensure the integrity of the research, safeguarding participants’ rights and well-being, and ensuring that the research is conducted honestly and without bias. Adhering to ethical guidelines also enhances the credibility and validity of the research, ensuring that the findings are accepted and respected in the scientific community.

Defining the population accurately is pivotal in data collection because it establishes the boundaries for the study and ensures that the data collected is relevant and applicable to the research question. A well-defined population ensures that the sampling methods employed, and subsequently, the inferences drawn, are pertinent and valid for the group being studied. If the population is not accurately defined, the sample derived may not be representative, leading to biased results and reducing the external validity of the study. It ensures that the researcher is studying the correct set of individuals or items to derive meaningful and applicable conclusions.

Voluntary response sampling is a type of non-probability sampling where participants self-select to be part of a study or survey. This method can introduce bias because individuals who choose to participate often have strong feelings or vested interests in the subject matter, leading to over-representation of certain views or characteristics. For instance, in a survey about public transport, individuals who have had notably positive or negative experiences may be more inclined to respond, while those with neutral or mild opinions may not participate. Consequently, the results may not accurately reflect the broader population’s views or experiences, skewing data and potentially leading to misleading conclusions.

Practice Questions

A researcher is conducting a survey on the monthly expenditure of families in a city. He decides to use stratified sampling to ensure representation from different income groups. The city has 100,000 families, divided into 4 income strata: Low (20,000 families), Medium (50,000 families), High (25,000 families), and Very High (5,000 families). The researcher decides to sample 1% of the families from each stratum. Calculate the number of families to be sampled from each stratum and the total sample size.

The researcher plans to sample 1% of families from each income stratum. To find the number of families to be sampled from each stratum, we multiply the number of families in each stratum by 1% (or 0.01).

  • Low-income stratum: 20,000 x 0.01 = 200 families
  • Medium-income stratum: 50,000 x 0.01 = 500 families
  • High-income stratum: 25,000 x 0.01 = 250 families
  • Very High-income stratum: 5,000 x 0.01 = 50 families

To find the total sample size, we add the samples from each stratum: 200 + 500 + 250 + 50 = 1000 families. Thus, the researcher will sample a total of 1000 families, with samples from each stratum being 200, 500, 250, and 50 respectively.

A study is being conducted to understand the average time spent on homework by students in a school. The researcher decides to use systematic sampling to select the participants. The school has 800 students and the researcher wants to select a sample of 40 students. Determine the sampling interval and describe how the researcher would select the participants using systematic sampling.

To determine the sampling interval for systematic sampling, we divide the total population size by the desired sample size. In this case, the total number of students is 800 and the researcher wants a sample of 40 students. Sampling Interval = Total Population / Sample Size = 800 / 40 = 20

Thus, the researcher should select every 20th student to be part of the sample. To implement systematic sampling, the researcher would first randomly select a student from the first 20 students (let’s say the 3rd student as a starting point). From this starting point, every 20th student is then selected until the desired sample size of 40 is reached. So, the students selected would be the 3rd, 23rd, 43rd, 63rd student, and so on, until the sample is complete. This method ensures a spread and representative sample from the entire student population.

