2: SAMPLING

In most statistical studies, we wish to quantify something about a population. For example, we may wish to know the prevalence of diabetes in a population, the typical age that teenagers begin to smoke, or the average birthweight of babies born in a particular community. When the population is small, it is sometimes possible to obtain information from the entire population. A study of the entire population is called a census. However, performing a census is usually impractical, expensive and time-consuming, if not downright impossible. Therefore, nearly all statistical studies are based on a subset of the population, which we will call the sample.

When selecting a sample, we need to know how many people to study and which people from the population to select. A study's sample size depends on many factors, and will be the topic of future study. Presently, let us consider how to select a valid sample. A valid sample is one that represents the population to which inferences will be made. And although there is no fail-safe way to ensure sample representativeness, much has been learned over the past half century about sampling to maximize a sample's usefulness. One thing that has been learned is that, whenever possible, a probability sample should be used. A  probability sample is a sample in which:

(Cochran, 1977, p. 9).

This forms the basis by which generalizations about the population can be made.

The simplest form of a probability sample is the simple random sample. A simple random sample as a sample in which each member of the population has an equal probability of entering the sample. This ensures that the sample will be:

These are two extremely important features of a simple random sample.

In order to select a simple random sample, it is best to start with a sampling frame of all sampling units in which each population member is then assigned an identification number between 1 and N. A random number generator is then used to determine which of the n individuals will be sampled. (Random number generators can be found at www.random.org/nform.html or www.randomizer.org/form.htm). Here, for example, is a list of 10 random numbers between 1 and 600: 35, 37, 43, 143, 321, 329, 337, 492, 494, 546. Let us use these random numbers to select 10 individuals from the population located at www.sjsu.edu/faculty/gerstman/StatPrimer/populati.htm. Notice that this population contains N = 600, with variables AGE, SEX, HIV status, KAPOSISARComa status, REPORTDATE and OPPORTUNIStic infection. Our sample is:
 

ID AGE SEX HIV KAPOSISARC REPORTDATE OPPORTUNIS
35 21 F Y N 01/09/89 Y
37 42 M Y Y 10/21/89 Y
43 5 M N Y 01/12/90 Y
143 11 F Y N 02/17/89 Y
321 30 M Y Y 12/28/89 Y
329 50 M Y Y 12/29/89 N
337 28 M N N 08/19/89 Y
492 27 . N N 08/31/89 N
494 24 M Y Y 08/19/89 Y
546 52 . Y Y 10/13/89 Y

(Dots represent missing values.)

Let us review our procedure for selecting a simple random sample:

(1) A sampling frame of all population members is compiled.
(2) Population members are idenfied with unique identification members between 1 and N.
(3) The researcher decides on an appropriate sample size for their study.
(4) The researcher selectes n random numbers between 1 and N.
(5) Persons with identificaiton numbers determined by the random number generator are included in the sample.Of course, in practice, selection of a simple random samples is not as "clean" as this. Still, this procedure serves as our ideal by which to compare actual survey samples.

Random sampling can be done either with replacement or without replacement. Sampling with replacement is done by "tossing" population member back into the pool after they have been selected. This way, all N members of the population are given an equal chance of being selected at each draw, even if they have already been drawn. Sampling without replacement is done so that once a population member has been drawn, this population member is removed from the pool for all subsequent draws.

The ratio of the sample size (n) to population size (N) is called the sampling fraction. Let f represent the sampling fraction, so f = n / N. Notice that, in our illustrative sample, f = 10 / 600 = .0167.

Comment: Many statistical procedures assume that sampling is done with replacement. For practical reasons, however, most survey sampling is done without replacement. This makes little difference when the sampling fraction is small (say, less than 5%). However, when the sampling fraction is large, some of our procedures will have to modified with what is known as a finite population correction factor.

Vocabulary

Census: a study in which the entire population is "sampled."
Experimental study: a study undertaken in which the researcher has control over some of the conditions in which the study takes place and can allocate an experimental factor ("treatment") being studied.
Independence: sampling such that the selection of one unit into the sample has no influence over the selection of any other unit.
Observational study: a study undertaken in which the research has no control over the factors being studied.
Population: The universe of potential values from which a sample is drawn.
Probability sample: a sample in which every population member has a known probability of being included in the sample.
Sample: a subset of the population.
Sampling frame: a list of the population from which a sample is drawn.
Sampling fraction: the ratio of the sample size (n) to population size (N)
Sampling with replacement: a sample in which one can replace subjects into the sampling frame after each draw.
Sampling without replacement:a sample in which one cannot replace subjects into the sampling frame after each draw.
Simple random sample: a sample in which each member of the population has an equal, nonzero probability of entering the sample; simple random samples are characterized by independence and unbiasedness.
Unbiasedness: sampling so that each unit in the population has the same probability of entering the sample.

Notation

n - sample size
N - population size
f - sampling fraction (f = n / N)