POPULATIONS, SAMPLES, AND BUILDING BRIDGES BETWEEN THEM IN EPIDEMIOLOGICAL STUDIES
W. Kalsbeek and G. Heiss
Department of Biostatistics and Epidemiology, respectively, School of Public Health, University of North Carolina, Chapel Hill, NC 27599-2400.
KEY WORDS: sample design, statistical inference, sample weights, analysis of data from complex samples
Shortened Title: POPULATIONS, SAMPLES, AND BRIDGES
Publication:
Kalsbeek, W and Heiss, G. (2000). "Building Bridges Between Populations and Samples in Epidemiological Studies," Annual Review of Public Health, 21:1-23.
CONTENTS
|
POPULATION SAMPLING IN EPIDEMIOLOGY |
|
ISSUE 1: WHICH APPROACH TO STATISTICAL INFERENCE? |
|
Design-Based Approach |
|
Model-Based Approach |
|
ISSUE 2: HOW SHOULD THE SAMPLE BE CHOSEN? |
|
Information Gathering |
|
Cluster Sampling? |
|
Stratified Sampling? |
|
Which Randomization Device? |
|
How Large a Sample? |
|
ISSUE 3: WHAT ABOUT SAMPLE INTEGRITY? |
|
Frame Problems |
|
Nonresponse |
|
Sample Weights |
|
ISSUE 4: WHICH SAMPLING FEATURES ARE IMPORTANT IN ANALYSIS? |
|
CONCLUDING REMARKS |
ABSTRACT
The increased use of rigorous population sampling methods and the analysis of data from those samples in cross-sectional surveys, case-control studies, longitudinal cohort investigations, and other epidemiological research efforts has raised important statistical issues for health analysts to consider. Using intuitive reasoning and a variety of empirical results from well-known data sources, this paper describes the origin, implications, and some plausible resolutions for several of these issues. Some of the main issues we consider include: establishing whom the sample represents, using sample weights, understanding the role of other important features such as the use of sampling stratification and the selection of clustered groups of population members, as well as finding ways to analyze study data with key sampling features in mind. Ultimately, resolution to all of these issues requires that analysts clearly define a reference population and then understand the role of design features in relating sample results to that population.
POPULATION SAMPLING IN EPIDEMIOLOGY
Most empirical knowledge throughout history has been based on incomplete observation and therefore samplings of the human experience (43). In each such case there has been a population, some collection of persons or objects for which knowledge was sought, and a sample, a portion of the population to be observed, thus providing the informational basis for acquiring knowledge about the population.
The need for sampling in epidemiological research stems from the nature of several overlapping types of research designs commonly used in the field (17,37,44). For each of these designs one makes statements about a targeted group of individuals called the study population, based on observations obtained from a representative portion of the population when it is impractical to examine the entire population.
Who or what is sampled, and how randomization is applied in choosing the sample, may differ among these research designs, depending on how the study population is best sampled to produce the estimates required by the design. For instance, the design of a cross-sectional study to profile the extent of disease or exposure to disease within a relatively confined timeframe, or of a field trial intending to investigate the efficacy of health intervention strategies, may call for gathering data from a sample of persons identified by selecting the communities, neighborhoods, or structures in which they live. Some of the data in cross-sectional studies and field trials may be best collected from medical or insurance records. Since the health-related interventions being evaluated in community intervention field trials are applied at the community level, sampling of subjects may proceed as above but including randomization of clusters of subjects at the community level. Sample ascertainment in case-control studies, where the research goal typically is to assess the role of a suspected exposure, often involves a mixture of random and nonrandom sampling. Identification of "cases" with the relevant medical condition is generally done through a purposive sampling of health care providers. On the other hand, the group of non-diseased "controls" is often randomly chosen from the set of non-cases in a source population using similar general population sampling methods to those used in cross-sectional or field studies, with explicit efforts made to achieve comparability to the set of cases, individually or as a group (4, pp. 108-113). Samples to prospectively observe the same set of study subjects chosen from a heterogeneous population or a series of targeted population subgroups in cohort studies may utilize similar methods of sampling, though unlike studies done at a single point in time, the statistical integrity of the initial cohort sample can diminish over time with attrition in the sample. For all of these research designs the population about which one makes statements of finding may not be the same as the population sampled. To make the leap of inference from the sampled population to some other population requires a justifiable basis (15).
While the use of samples in epidemiologic research has evident advantages in reducing the cost of data gathering, it is important to understand both the statistical and practical implications before using it. Sampling is particularly advantageous in studies where the population is large and/or scattered, and resources are scarce. Studies based on well-constructed samples can indeed be done for a fraction of the cost of a complete enumeration of the population, although savings do not equate to the percent of the population that is not sampled, because of the cost of sampling. Related to its cost saving, sampling also enables the investigator to concentrate resources on a smaller group, and thus increase study validity since greater effort can be expended on achieving higher participation rates and increased quality.
Some drawbacks associated with sampling derive from the fact that one gathers data from a fraction of the study population about which we hope to learn, leading to error in the estimates due to the absence of knowledge about those not sampled. This error, arising out of the difference between the estimate from the sample and the population characteristic being estimated, creates a statistical uncertainty one strives to minimize. Reduction of this sampling error is accomplished through prudent decision making in developing the sample design, a bit of good fortune in choosing which population members are randomly selected, and the use of a plausible approach to learn about the population from the sample.
This paper considers several important statistical issues that arise in the course of designing and analyzing data from samples in epidemiological studies. In so doing, we briefly trace the origins of the two main philosophies of statistical inference in population-based studies. The reader is referred to several excellent historical reviews of sampling and inference from sample data (29, 36,42,43). We also examine the process of deciding which features to include in creating a statistically appropriate sample of the population. Ways in which the statistical integrity of the chosen sample can be compromised are then noted, and we point out those features of the sample that are important in analyzing the sample data. In general we see that resolution to these issues is rarely clear-cut, thus requiring a balancing of the relative merits of alternatives.
ISSUE 1: WHICH APPROACH TO STATISTICAL INFERENCE?
In his dictionary of epidemiology Last (21) defines inference in statistics as "the development of generalization from sample data, usually with calculated degrees of uncertainty (21, p. 65)." This definition implies several things about the nature of statistical inference in epidemiological research. First, the concept of "generalization" suggests that there is an object to the inference --- a population of some form to which the statistical statements from the sample apply. Another implication of this definition is that the sample data are the basis for the any statements and that some philosophical framework is needed to accomplish the task. In a sense, the inference mechanism can be viewed as a "bridge" spanning the statistical chasm between information obtainable from the sample and information sought about the population. Finally, one notes that statistical "uncertainty" accompanies any statement made about the population from a sample, but that its level can often be quantified, thus providing a tangible indication of the sturdiness of the inferential bridge.
Design-Based Approach
The first recorded efforts to learn about populations from samples trace back over 200 years to attempts to estimate the population of France from a complete enumeration of a strategically chosen subset of geo-political districts (27). Stephan’s (43) informative account of the early history of sampling further notes that the use of randomization in choosing samples (i.e., probability sampling) first appeared in the latter part of the 19th century as an eventually controversial alternative to the purposive (nonrandom) methods of selection that been the accepted norm. An increased need for sample surveys during the economic and political turmoil of the 1930s provided the impetus to the further development of a design-based approach to estimation, or making statistical inference from samples where strategically chosen forms of randomization are applied in choosing the sample (Figure 1a).
The design-based approach produces statements about the population that are largely based on specifically how randomization was applied to the sample (circled I) whose data are used to make the statements of statistical inference (circled II). Much of the early theory of sampling methods following this approach was developed by statisticians at the U.S. Bureau of the Census, motivated by earlier work by Cochran and Neyman, and then incorporated into the first major sampling texts (see, e.g., 5,11,18). It has been the prevailing approach to estimation from samples for over 50 years, as described in more recent texts (24,26,39).
When a sample is used to produce an estimate (c) of some characteristic (C) of the sampled population (e.g., a prevalence rate, the mean of a measurement, or a coefficient in some regression model), uncertainty in inference following a design-based approach is measured by the estimate’s mean squared error (MSE). For a basic understanding of the origin and meaning of a MSE, first note that the presumed reason for choosing a sample is to estimate C using an estimation strategy that produces an estimate (c). Also note that a design employing probability sampling can produce many different samples, and thus values of c, each with a corresponding statistical probability which in principle at least can be determined. Considering only the effect of randomized selection in choosing the sample, and noting that the difference between an estimate and the characteristics being estimated, (c - C), is the estimate’s sampling error, the expected value of the square of the sampling error among all possible samples (and estimates) is called the mean squared error of c; i.e.,
, (1)
which as seen above is defined by the variance (
)
and bias (
)
of c among all possible samples. When considering the combined effects of randomness
from sampling and other sources (e.g., measurement error, nonresponse, etc.),
the formula for MSE(c) becomes more complicated, although it is still dependent
on any variances and biases linked to these random sources (23).
Implicit in the formulation of
are two fundamental statistical qualities of the sample design one considers
in evaluating the ability of the resulting sample to produce estimates of C.
One is validity. A valid sample is achieved by using randomization in
a way that each member of the study population has some calculable chance of
being chosen, provided the known probabilities of sample selection are properly
used in making estimates of C. The impact of a valid sample on the
is the potential to avoid biases contributing to
,
such as those arising out of mistakes in the estimation process (e.g., failure
to use sample weights; see ISSUE 4), or due to incompleteness in the population
lists used for sample selection (e.g., see coverage error under ISSUE 3). Another
factor affecting the
is
the formulation chosen to estimate C, which through the use of ratio estimation,
regression estimation, and other methods employing ancillary information can
improve the quality of estimates (4; pp. 150-185).
Unfortunately, a valid sample design is not necessarily a "good"
sample design in the broadest sense of the term. For instance, a sample of n=2
public school students in the U.S., chosen by randomly selecting two schools
from a complete national list of public schools and then randomly choosing one
student from the two school rosters, would be valid though clearly deficient
in supplying information about the Nation’s students. This example suggests
that validity is a necessary but not sufficient quality of a sample design.
One also needs efficiency, measured by the stability of estimates among
samples the design can produce; i.e., by the
component of the
,
which is determined by sample size and by how effectively various selection
strategies (e.g., stratification) are used. The use of probability sampling,
combined with the inclusion of appropriate selection strategies to increase
the statistical efficiency of estimates from the resulting sample, thus characterize
a "good" sample design.
Model-Based Approach
During the past 30 years various estimation strategies under
a model-based approach to statistical inference have emerged as an alternative
to the design-based perspective (3,38). As suggested by its name, this class
of analysis methods depends on statistical models whose main purpose is to explain
the origin of each set of key outcome measurements in the sampled population
(Figure 1b). For each measurement set, the study population is seen as an outcome
from an underlying random process (circled I), which is portrayed by an assumed
statistical model. Explained another way, members of the study population, with
their associated measurements, are viewed as a random sample from a more abstractly
defined, infinite set of measurements often called a "superpopulation."
Making statements about the study population involves first learning about the
superpopulation using data from the observed sample to fit the assumed model
(circled II), and then, if specific knowledge is sought for the study population,
relying on the fitted model to predict data for sample nonmembers in making
statements about the study population as a whole (circled III). Indicators of
statistical quality like the mean squared error of estimates are also use in
evaluating model-based estimates, although the "uncertainty" reflected
in these MSEs is tied to the random process defined by the assumed model or
other sources of randomness, rather the how randomization was used to choose
the sample. Significantly, different underlying models may apply to each set
of outcome measurements. Also, a characteristic (
)
of the superpopulation may be seen by some as the object of inference for some
types of analysis (e.g.,
as the coefficient for an independent variable in an underlying multivariate
regression model).
It is important to note that methods of randomized sampling are used in connection with both design- and model-based approaches, but for somewhat different purposes. Features of the selection process in a well-conceived sample under a design-based perspective are chosen to generate the best possible representation of the study population given available resources, since these same features must be explicitly accounted for in learning about the population from the sample.
A well-represented sample is also valued under the model-based approach, although specific aspects of the design are not as directly relevant to the inference process, since a key use of sample data is to fit the underlying model to learn about the superpopulation, and possibly the study population. For this reason, one might view the emergence of model-based methods as a return to the statistical inference based on nonprobability sampling that was in common use prior to the advent of probability sampling in the early 1900s. Now, however, estimation methods based on highly sophisticated models in conjunction with the classical theory of statistics can be applied using high speed computers, thus making their use more plausible. The reality of statistical practice at present is that the widespread use of model-based methods remains somewhat elusive to mainstream practitioners because most software packages do not implement them.
Also important in choosing between inference approaches is one’s level of comfort in using models in the design and analysis of a sample, since models are used under both approaches, although somewhat differently (15). Practitioners following a design-based approach rely on models primarily as a vehicle to guide attempts to improve the efficiency of the sample design, although this use of models rarely affects the validity of the sample design. For instance, cost and MSE models are used in "Neyman allocation" to decide how large the sample sizes should be for various population groups; yet allocation results from this model are robust to modest departures in the model (4; pp. 115-117). Moreover, modeling by design-based analysts is done to adjust sample weights to at least partially offset the biasing effects of sample imbalance due to nonresponse (see, for instance, the weighting class adjustment in ISSUE 3, where propensity of response for a member of the sample is estimated from the response experience of other "similar" members of the sample). Failure in these models can compromise the bias reduction goal of these adjustments (16). Models used in conjunction with model-based methods are central to the validity of estimation results, thus making it (estimation) vulnerable to model misspecification (12). Thus, if there are questions about the basic assumptions, the model-based methods may be inappropriate.
ISSUE 2: --- HOW SHOULD THE SAMPLE BE CHOSEN?
As noted previously, an effective sample design is key to the success of a population-based epidemiological study, regardless of the inference approach one follows. However, development of the sample design is typically given somewhat greater priority when the design-based perspective is followed, since the statistical quality of findings depends more heavily on how randomization is applied in choosing the sample. For this reason we largely adopt this perspective in this section of the paper, while recognizing that some of what we present also applies to the development of the sampling plan from the model-based perspective (e.g., the use of stratification).
Developing the sample design (i.e., the specific plan of action followed in choosing the sample) for an epidemiological study is largely a sequence of decisions that involves the study’s information goals and a variety of statistical "tools" that may be used to address specific statistical needs linked to those goals. This decision process is typically subject to a variety of constraints that are almost always fiscal in nature, but may also be institutional, logistical, or temporal. The goal of the design development process therefore is to find that configuration of sampling tools that will meet the scientific needs of the study, subject to any constraints. To be done well, the sample architect must be able to uncover the study’s needs and thoroughly understand the sampling tools that might be used, while successfully engaging the study’s decision makers in the design development process.
The design decision process in sampling is far from self-evident. First of all, choices in the decision process are rarely obvious. Thus, the science and art of decision making must converge to produce statistical optima with a measure of common sense. Implications (both positive and negative) must often be balanced in weighing the relative merits of alternatives in design decision making. Therefore, this process should not be seen so much as the search for a single "best" design, but for one among possibly several equally reasonable approaches. Gaps between the theoretical and actual effects of many design features make complete agreement on the statistical implications of many design alternatives virtually impossible (e.g., the effect of variable sample weights on the precision of estimates). Finally, while our presentation of how a sample design is developed might suggest that this process is purely sequential, some parts of the decision process may be iterative or overlap with other parts. For instance, the initial sampling plan may need to be revisited if it becomes apparent that the sampling frame called for in this plan is inadequate.
Two basic questions guide the development of a sampling plan. One asks what is to be learned about the study population from the sample, and the other (arising out of the first) asks which sampling tools are best used to meet the study’s information goals.
Information Gathering
The process of developing a sampling plan thus first requires information gathering about the study. This phase usually begins by defining the study’s scientific objectives. These typically dictate the type of research design to be used, and understanding these objectives allows the designer to identify population measures appropriate to the study goals and associated measurements that will figure into later design decisions. For example, investigating the efficacy of a new community-level nutrition intervention program in a field trial may imply several key outcome measurements like body mass, nutrient content, and portion size. The designer may then create or identify sources of existing information on these measurements, such as pretests or pilot studies done prior to the main part of the planned study. Data from prior research studies involving these or similar measurements may also be sought.
Once the study’s goals have been established, the next required design element is the definition of the study population. In design-based studies, this population is the same as the population to be sampled and the population to which statements from the population are to be inferred. In model-based design development the sampled population and inference population may differ. Defining the study population usually requires a set of eligibility criteria that must be met. These criteria typically refer to location and duration of residency, health status, as well as other socio-demographic characteristics of relevance to the study. As a byproduct of information on the objectives and population, one determines the units of observation (i.e., establishing whose data will be collected in the study).
Various types of population data may be useful in developing the sample design. In addition to population-wide descriptive data to help the designer understand the size and variation of these measurements among members of the population at large, it is often helpful to profile subgroup differences in these measurements to identify potential correlates of the measurements. Measurement-specific information on intraclass correlation, indicating how internally similar various cluster groupings of the population (e.g., adults living in the same county) are relative to the population as a whole, may also help with later decisions on if and how best to sample these clusters (4).
Another important item of background information on the study is the definition of subgroups, or domains, of the study population that may serve as a particular focus during analysis. For instance, changes in the demographic profile of the study population may dictate the need for assurances that estimates from a planned cross-sectional study will be of sufficient statistical quality for a subgroup. If in addition to learning about the population as a whole there is scientific utility in learning about these population subgroups as well, one aims to know what percent of the population is in these domains and how estimates from these domains contribute to meeting study objectives.
Analysis of population domains may be used to examine differences among policy relevant groups (e.g., by poverty level, health insurance status, or geographic region). It may also provide an initial search for predictors of key outcome measurements. This information on important population domains often impacts design decisions related to sample size, to ensure that adequate sample sizes will be available in the analysis phase to learn about both the overall population and key domains.
Several other nonstatistical pieces of information are relevant to the information gathering phase of design development. Information on resources available to the study may help the sample designer to decide on the level of complexity that can be tolerated in the eventual sample design. Information on resources and timeframe together guides the designer in assessing the feasibility of various design options. By way of illustration, one might decide against using strata for stratified sampling for a study if the required information must be obtained by in-person interviews, requiring extensive training of study personnel. Finally, one would also find out about possible resources that could be used to construct the sampling frame from which the study sample is drawn, such as administrative records or population lists. Planning sample selection for a cohort study of first grade students, for example, might explore the suitability of lists of elementary schools if these students were to be selected through the schools they attend. Information to be gathered and tested through pilot studies would include the completeness, currency, content, and accessibility of the list, parental consent, migration and occupational mobility, among others.
The next general step in the design development process is to determine which sampling tools best contribute to meeting the information needs of the study. Several sampling texts provide excellent accounts of the theory underlying these tools. (11,18,24,26,35,38). Our purpose here is to briefly examine the statistical utility and implications of several tools connected to three main features of a sample design: cluster sampling, stratification, and the device applying some form of randomization to select the sample. In so doing, we will also note how these tools might be configured into the sampling plan for epidemiological studies.
The decision making phase in the development of the sample design for a study begins by determining what mode is to be used in gathering data from the sample. The most common modes of data collection in epidemiological studies are mail, telephone, and in-person (28). Deciding on mode is done early, since resource and frame information often dictate it. For example, limited resources and the availability of a reasonably complete list of study population members often implies the need to do a mail survey, whereas the absence of a population list and a more generous set of resources needed to obtain more complex population measurements usually points to data gathering by telephone or in-person since higher response rates can be expected.
Cluster Sampling?
Once the mode of data collection is established, one considers if, and if so how, cluster sampling is to be used. A cluster sample is one in which a sample of groups of population members is chosen as a stage of the sampling process. Clusters are typically defined by levels of some type of socio-political hierarchy (e.g., levels for a statewide population of first graders consisting of: first grade students, within classrooms, within elementary schools, within counties, within regions, and within the state).
Cluster sampling involves randomized sampling within one or more levels of a hierarchy or the study population. Each level sampled corresponds to a stage of the sample. For example, a two-stage statewide sample of first graders might involve sampling schools in the first stage, thus designating the school to be the first stage or primary sampling unit (PSU). The second stage might then consist of separately choosing a sample of first graders within each sample school, thus making the student its secondary sampling unit (SSU).
Sampling clusters can substantially reduce study costs if data
gathering requires face-to-face contact over a geographically expansive study
area. It also eliminates the need for a sampling frame consisting of a complete
population list. In many populations such a list is expensive or impossible
to create. Accompanying these practical advantages is an important statistical
disadvantage of cluster sampling. This limitation is manifest as an increase
in
due mainly
to the tendency for members of the same cluster to be relatively more alike
than members of the population at large. The extent of this within-cluster homogeneity
is commonly measured using the intraclass correlation coefficient (
),
which for most measurements and hierarchies is between 0.00 and 0.15.
Relative to a comparable unclustered sample of the same sample
size,
contributes
to a multiplicative increase in
.
In a design with n respondents chosen from m sample PSUs, this effect on the
statistical quality of an estimate (c) is known as its design effect,
which is commonly modeled as,
,
where
is
called the average sample cluster size. Thus, as long as clusters are
relatively homogeneous with respect to the measurement corresponding to C (i.e.,
is positive),
cluster sampling will always be less statistically efficient than a simple unclustered
sample, with the amount of relative inefficiency directly related to the level
of intra-cluster homogeneity. More intuitively, consider the case where
=1
for the population measurement (i.e., where variation in the study measurement
exists between clusters, but all members of the same cluster have exactly the
same value). In this instance, relatively little information about the measurement
in the population is obtained by sampling a few clusters, and increasing the
number of population members selected within sample clusters adds no new information
about the measurement. While most naturally occurring clusters are less homogenous
than this extreme case, the same principle applies. To the extent that resources
will allow, the best cluster sample is one with a larger number of clusters
and small sample cluster sizes (11; Vol. 1, p. 286 ).
Stratified Sampling?
The second general feature that may be used in a sample design is stratification, the process of dividing a group of population members into non-overlapping subgroups called strata for the purpose of improving the efficiency of the sample design. Stratified sampling then means that the sample design has incorporated stratification somewhere in the selection process . Stratification is conceptually similar to cluster sampling in that it is applied to one or more levels of a population hierarchy. It can therefore be applied to sampling at any stage of a multi-stage cluster sample, although in multi-stage cluster sampling it is almost always used in choosing PSUs for the first stage of selection, since when properly applied its use offsets some of the losses in statistical efficiency caused by sampling clusters. It can also be applied to the sampling of individual population members in an unclustered sample.
The main difference between cluster sampling and stratified sampling is that groups are not sampled at a stratification level of the population hierarchy. For example, if stratification were applied at the "region" level of the statewide hierarchy of first grade students previously presented , that would imply that the two-stage sample of first graders would be separately chosen in each of the state’s regions.
Sample stratification is used to improve the statistical efficiency
of certain study estimates (c). It may be used to reduce
for total population estimates by assuring adequate representation of all strata
in the sample. To best achieve improvement in the efficiency of total population
estimates one hopes for key study measurements to be statistically correlated
with the population characteristics used to define the strata. Improved efficiency
is achieved in this case since the sample is more likely to reflect the full
spectrum of individual measurements tied to the population characteristic (C),
thus tending to minimize the sampling error of estimates (c). Stratification
may also be used to improve the efficiency of estimates for relatively small
but important population subgroups by "oversampling" (i.e., designating
disproportionately large sampling rates to) the strata that define or contain
large percentages of them. In some designs stratification is used for both purposes.
In addition to deciding how to stratify the design one must determine how the overall sample is to be allocated among strata. Four types of stratum allocation are commonly seen. When benefiting total population estimates is the most important use of stratification, either proportionate or optimum allocation is commonly used. All stratum-specific sampling rates (i.e., the proportion of population members in the sample) in proportionate allocation are the same, thus producing a sample with proportionately the same representation among strata as the population. This is a safe but not necessarily the best allocation for total population estimates. On the one hand, the efficiency is no worse that a comparably unstratified sample, but it may not be the best either.
Sampling rates in a design with optimum allocation among
strata are the most cost-efficient ones based on models for
and
the cost of doing the study. For a simple descriptive population characteristic
(C) to be estimated from a stratified sample, this means that one must apply
the largest sampling rates in the strata with the greatest diversity in the
measurements of relevance to the definition of C and the lowest cost of adding
another member to the sample. Conversely, strata with the least diversity and
highest unit costs are assigned the lowest sampling rates. In theory, then,
optimum allocation leads to more efficient total population estimates than proportionate
allocation, although in practice the difference in quality may not be great
if stratum costs and diversity are relatively similar among strata.
While one can achieve close to the best precision of total population estimates if stratum-specific cost and diversity data are reasonably good, there is an element of statistical risk in the use of optimum allocation if the stratum data are substantially incorrect. The use of stratification in this worst-case situation can produce worse efficiency in total population estimate than not using stratification.
Another use of disproportionate allocation among strata is to facilitate an "oversampling" of one or more relatively small but important population domains. For instance, if the two-stage sample of first graders above is used for a cohort study to evaluate the long-term health effects of immunization one might need to focus on students with disabilities. If the percentage of students with disabilities is small, the sample size in this important subgroup may be too small to adequately achieve study goals. Hence, students with disabilities might be oversampled by stratifying students by disability status in sampling within selected schools, and applying relatively higher sampling rates in the disability stratum of each school. Oversampling in a design setting like this is likely to achieve the sample size increases that are sought since the group to be oversampled can be fully isolated in the strata that are formed. It may be much less effective in achieving dramatic sample size increases when the trait cannot be as effectively isolated (14).
Balanced allocation in stratified sampling involves designating the same sample size for each stratum. This allocation is generally used in designs for strata of unequal size and where the main use of the sample data is to prepare stratum-specific estimates or to compare estimates among strata. This allocation is thus disproportionate in populations with unequal-sized strata, and thus may somewhat limit the efficiency of estimate from the total population, to the extent that the size and composition of the strata in reference to the main study measurements are not correlated. This loss in precision is tied to the effect of variation in sample weights that this allocation yields.
Which Randomization Device?
Decisions concerning where in the population hierarchy cluster sampling and stratification are to be applied determine the structure of the sample to be generated. Yet to be determined for each stage of sampling is precisely how randomization will be used in sampling units at the corresponding level of the hierarchy. Several options are available to the designer, although all require a list (or frame) of units to be sampled to implement the selection process. One selection device is simple random sampling with replacement (SRSWR), in which each selection at random from the list is replaced in the list before the next selection is made. Repeat selections are therefore possible in SRSWR. Unclustered samples chosen in this way most closely resembles iid (independent and identically distributed) random samples assumed in much of classical statistical theory.
Another commonly used selection device is simple random sampling without replacement (SRSWOR), which is similar to SRSWR except that selections are not replaced on the list and the resulting sample thus has no repeat selections. The advantage of SRSWOR over SRSWR is higher statistical efficiency for study estimates due primarily to the so-called finite population correction, which for simple uncluster SRSWOR is 1-f, where f=n/N is the sampling rate where a sample of size n is chosen from a population of size N.
Another class of selection devices is based on selection of clusters with probabilities proportional to size (PPS), where "size" refers to the number of population members in each cluster. Since actual size measures are usually unknown, hopefully accurate "measures of size" (Mos) are used instead. Several with- and without-replacement PPS sampling methods have been proposed (4). PPS sampling is generally used to select clusters in all but the last stage of multi-stage samples, particularly when the number of population members varies considerably among clusters at all levels. One uses PPS sampling mainly to offset reductions in estimate efficiency that can result from applying SRS devices to clusters of unequal size. Common byproducts of PPS sampling in a multi-stage design are equal selection probabilities for all chosen population members (a statistical advantage), and roughly equal sample sizes in each sample cluster (a practical advantage).
Finally, systematic sampling is a relatively simple
selection device that is often used when other devices are too complicated to
use. Selection involves choosing a random start, finding the corresponding sampling
unit on the frame, and then selecting other sampling units on the frame by sequentially
applying an interval of constant length (i.e., roughly the inverse of the intended
sampling rate) after the random start until the entire frame has been traversed.
Unlike most other randomization devices in sampling, where the order of the
sampling frame is irrelevant, frame order is important and can be beneficial
in samples using systematic selection. Indeed, estimates from a sample employing
systematic selection from a frame that is ordered by some set of criteria have
roughly the same statistical efficiency as a proportionately allocated using
strata defined by ordering criteria. A systematic sample chosen in this way
is said to be implicitly stratified by those criteria. The primary drawback
to systematic sampling is that unbiased estimates of
are
unavailable, thus necessitating the use of approximate methods which tend to
understate the actual efficiency of estimates (18). Since these variance estimation
methods follow the order of selection, this process information must be retained
with the sample data.
How Large a Sample?
Deciding on sample size is a design issue with either a statistical or practical solution, but with statistical implications in any event. This decision is usually made after the background information has been collected and a degree of closure has been reached on the issues of cluster sampling and stratification. The statistical solution requires first that statistical standards of statistical efficiency be established in collaboration with study leaders for each important estimate to be generated and test to be run during analysis.
Various statistical measures of statistical efficiency may
be considered, including the variance, standard error, margin of error, and
power to detect significance, although most are mathematically related to the
variance of study estimates,
,
which in turn can be written as a function of the desired sample size (n) for
most designs. Solving for n in the assumed efficiency model leads to the result.
For example, one might require a margin of error of 0.05 on the overall estimated
rate (c) of immunization coverage in the baseline of the statewide cohort study
mentioned earlier. If a stratified two stage cluster sample is to be used in
that study, a reasonable model for the margin of error would be
,
where t is the confidence level for the margin of error and C is the conjectured
actual coverage rate in the population. Solving for n here we have
,
assuming t=1.96 for 95% confidence, C=0.70, and
=1.50.
If the quality standards are modified slightly to require the same margin of
error, but for a region that comprises 25% of the state’s population, the recommended
total sample size must be increased to,
.
When resources determine how large a sample a study can afford,
the usual strategy is to create a simple formula for the total cost of the study
as a function of n, and then solve for n as above. Even when the proposed sample
size is based on the study’s budget, it is wise to project the level of statistical
quality on important study estimates, to be assured that the level of efficiency
one can afford is sufficient. For example, if we find that the budget will only
support n=300 baseline respondents, this implies that,
,
which may still be acceptable statistical efficiency for the study.
While the sample design in its final form consists of details on how to choose the study sample, the actual selection process may depart somewhat from this plan. Several facets of the study implementation may cause this departure. For example, attrition in the selected sample may exceed that which was expected, thus requiring midcourse adjustments in sampling rates or adoption of a random substitution plan to replace nonrespondents. The original plan for constructing a PSU sampling frame may be revised to deal with an unexpected number of virtually empty clusters, or the definitions of strata used in selecting the first stage sample may be changed to improve variance reduction effect of the sampling strata. These modifications can challenge one’s ability to effectively use data from the sample in learning about the population, thus demanding ways to deal with their implications. We next describe some ways to deal with these practicalities.
ISSUE 3: WHAT ABOUT SAMPLE INTEGRITY?
Some parts of the intersection of sample design and the study implementation impact the efficiency of study estimates, while others strike at the essential validity of the resulting sample-generated estimates. We turn our attention now to the effects of the sample selection process on the statistical integrity of the sample. In so doing we identify the main sources of lost integrity and some common remedies.
Statistical error in study estimates not due to sampling can be grouped into three main categories: error due to coverage problems with the sampling frame, error arising form attrition in the sample, and error due to problems in the study measurement instruments (10,23). While measurement error is an important source of survey error, it does not involve problems that uniquely influence the ability of the sample to function as the population in miniature. On the other hand, frame problems can influence sample representation if the sample is selected from a universe other than the study population, while nonresponse can affect representation by creating selective imbalance in the sample on traits for which differential rates of response occur.
Frame Problems
Several troublesome problems can arise with the frame that is used to choose a sample. All reflect a lack of correspondence between entries on the frame and the individual members of the study population. Each addresses the ability of the frame and thus the sample to "cover" the population, which is why coverage error is another way of referring to problems with the frame in epidemiological studies.
Most frame problems fall into one of three categories. One
type of problem is undercoverage, where some members of the population
are not linked to any entry on the frame. Undercoverage is usually the most
serious problem and thus most widely recognized, since it can contribute to
significant increases in both
and
, particularly
the former. It is a well-known problem in cross-sectional and cohort studies
that gather data by telephone, where persons without a telephone, or without
a phone directory listing, will be excluded. The primary manifestation of undercoverage
is imbalance in the sample due to differential rates of coverage by the frame.
For instance, sample coverage in telephone samples aimed at the general population
tends to be correlated with household income, race/ethnicity, education, and
employment status (9). Bias due to undercoverage is inversely related to the
sample coverage rate (i.e., the proportion of the population that is
linked to the frame) and directly related to the aggregate difference (in the
study measurement corresponding to C) between those who are and are not covered
by the frame.
A second frame problem is overcoverage, where some entries
on the frame are linked to nonmembers of the population. These "ineligibles"
in the selected sample are usually recognized and become a source of sample
attrition, though the statistical effect of their presence is mainly to reduce
the sample sized and thus increase
.
They do not contribute to
,
and thus invalidate an otherwise valid sample. Movers and decedents are examples
of ineligibles in many cohort studies.
The third type of frame difficulty is multiplicity, in
which a member of the population is linked to more than one entry on the frame,
thus giving it multiple chances to be chosen. Patients sampled through health
care providers in case-control studies represent one type of sample with multiplicity
present. Like undercoverage, frames with multiplicity present can lead to increases
in both
and
. Bias is
increased in estimates if those with and without multiple links differ in the
aggregate with respect to the study measurement tied to C, and nothing is done
to compensate for the multiplicity. Variance may increases due to variable weights
if the analyst uses a so-called multiplicity estimator, in which the data are
specifically weighted to account for increases in the sample selection probabilities
for those with multiple links of the frame (41).
Nonresponse
Nonresponse is another practical aspect of a study that can
lead to loss in sample integrity by creating imbalance in the outcome of an
otherwise well-conceived sample design. This loss occurs mainly because response
rates tend to differ by certain types of study characteristics such as race/ethnicity,
age, income, education, population density, etc. (10). The primary statistical
manifestation of nonresponse is an increase in
,
although
can also increase if steps are not taken to offset resulting reductions in the
sample size and if nonresponse is viewed as a somewhat random phenomenon (23,
pp. 134-137). If one views population members as certain to respond or not respond,
bias due to nonresponse is inversely related to the sample response rate
(i.e., the proportion of the study-eligible members of the sample who respond)
and directly related to the aggregate population difference (in the study measurement
corresponding to C) between respondents and nonrespondents
Frame and nonresponse problems are handled somewhat similarly. Preventing each type is important and possible, but more commonly applied to nonresponse. Correcting mistakes on large frames can be relatively costly, while efforts to improve response rates is often feasible through the use of a variety of generally effective preventive strategies (e.g.,use of incentives, endorsements, and additional attempts to gain participation). Coverage and nonresponse problems can also be remedied by special efforts to measure the bias effects (e.g., through more intensive solicitation to gather study data from of a sample of initial nonrespondents to the baseline round of a cohort study). The last category of compensatory strategies involves making statistical adjustments to the sample data as part of the process of generating sample weights.
Sample Weights
A sample weight is a number tied to a member of a sample
that is intended to reflect the inverse of the member’s selection probability,
which is calculable in any probability sample. The weight (
) for each (i.e., the i-th individual) sample member can also be interpreted
as the number of population members represented by that member. A single set
of these weights is prepared for analyses involving data gathered for the sample
to whom the weights apply.
The use of weights in preparing estimates from samples traces
back nearly 50 years to the work of Horvitz and Thompson (13), who first noted
that unbiased estimates of population totals could be obtained by weighting
the data in performing the analysis (i.e., in effect multiplying each sample
measurement by its corresponding weight in aggregating the data to estimate
C). For example, the sum of the
amongst all sample members is an unbiased estimate of the size of the study
population (N), and the weighted sum of a population measurement (i.e., the
product of measurement times weight summed over all sample members) is an unbiased
estimate of the population total for that measurement.
The process of calculating sample weights becomes a part of
the strategy of dealing with frame and nonresponse problems when, as is frequently
the case, the final set of weights are adjusted for the imbalance resulting
from these problems. A probability of selection (
)
is first calculated for each sample member, based specifically on how randomization
was used to choose the sample. For example, this probability would be the product
of the stage-specific selection probabilities in the two-stage cluster samples
used in the cohort study on childhood immunization. The reciprocal of
becomes a provisional weight for the sample member; i.e.,
.
The first of two multiplicative adjustments is then made using
estimates of the likelihood, or propensity of response (
)
for each sample member. Intended as at least partial compensation for nonresponse
bias exclusively, the first adjusted weight is calculated as,
.
Estimates of response propensity may be response rates in strategically formed
subgroups of which members of the original sample are a part (16), or as one’s
predicted response status outcomes from a multivariate regression model (8).
Ideally, one hopes to form internally homogeneous adjustment subgroups so that
respondent and nonrespondent portions have similar values for the population
characteristic of interest, since reduction in estimation bias occurs to the
extent that this type of homogeneity is achieved (16).
To further compensate for any remaining imbalance due to nonresponse
and other imbalance arising out of random selection and any other sampling problems
attributable to the frame, the adjusted weights (
)
are post-stratified to the best available distribution of population
counts by a joint classification of the study population according to one or
more characteristics that are known to be correlated with key study measurements
(e.g., age, race/ethnicity, and gender). This step amounts to calculating the
final adjusted weight as,
,
where the numerator of the post-stratification adjustment (in brackets) is the
external population count for the group (g) of which the i-th sample member
is a part, and the denominator is the estimate of
obtained by summing
over all sample members in that group. Calculated values of
are added to the sample data file for analysis, thereby completing the process.
It is important to note that the two weight adjustments contribute to reduce estimation biases that may occur because of imbalance in the sample, but they rarely eliminate these biases altogether. Thus, the investigator may rely on other methods for dealing with imperfect frames and nonresponse, such as sampling a portion of the nonrespondents and applying extraordinary means to gather data from them (23, pp. 177-181).
One final sample integrity issue has to with the retention of structural identifiers of the sample design on the data files that are used for analysis. Since cluster selection, the use of stratification, and variation in selection probabilities may all be important in making statements about the study population from the sample, retaining complete selection files and sample identifiers for clusters sampled and strata defined at each selection stage is essential to facilitate the accommodation of these features in subsequent analysis. Failure to do so can make learning about the population more difficult, especially when analysis methods following a design-based approach are used. Given the presence of this sampling information, which of it is essential in successfully building the bridge between sample and population?
ISSUE 4: WHICH SAMPLING FEATURES ARE IMPORTANT IN ANALYSIS?
As we have seen, sample designs for several types of epidemiological studies may include any of several features that impact the statistical quality of estimates from the sample, including cluster sampling, stratification, and varying selection probabilities among sample members (leading to the computation of sample weights). Employing these features, however, yields samples whose data are neither independent nor identically distributed. The analyst of such data must then establish to what extent these features should be accommodated in analysis?
Much of the design- and model-based theoretical work in survey statistics during the past two decades has focused on analysis of data from samples with complex designs. Several helpful reviews have been written on this topic (15, 31, 32, 36). While earlier work during this period displayed a somewhat dramatic divergence in opinion concerning the importance of design features (design-based advocates said all features were important;
proponents said all features can be ignored), more recent results seem to suggest a convergence of views. Model-based statisticians now link weighted estimates with model-based estimation, and recognizing that error residuals in regression models may differ among clusters and strata has also led modelers to seek ways to account for these design features in analysis. Most, in fact, advocate the use of stratification in sample selection. During this same period, design-based analysts have widely used models in detailing their use of stratification and other design features. They have also recognized that their estimates of coefficients in regression modeling may not addressing the underlying interrelationship among variables in their models, especially when the population size is not large. Another important part of this convergence is a growing consensus concerning the importance of incorporating features of the sample design in the analyses. Research and debate continues, however, as to precisely which and how features should be incorporated.
The design-based perspective considers the use of randomization
in choosing the sample as the primary basis for estimating the characteristic
the study population. In this view, the way one formulates the estimate (c)
of the population characteristic (C) depends on how the sample is chosen. Thus,
the size of
and
in
will similarly depend on the nature of the sample design, as will appropriate
estimates (
)
of the variance of c from the sample. Since learning about the population by
means of confidence intervals and tests of hypothesis requires both c and
,
it stands to reason that analysis from the design-based perspective must consider
cluster sampling, stratification, and sample weights. Several empirical comparisons
of the statistical effects design specification have been reported (1,20,42).
Incorporating all design features has stimulated the development of a number
of widely available analysis software packages following several different approaches
(46) to obtaining
from a design-based perspective. A number of recent reviews of these packages
have been published (e.g., see 2, 6).
Incorporating design features is especially important in descriptive
profiles and simple comparative analyses, such as those found in cross-sectional
studies, field trials, cohort studies, and others where results of the study
are most appicable to the sampled population. In the design-based view, ignoring
cluster sampling tends to understate
and overstate significance levels in tests of hypothesis, whereas ignoring stratification
has the opposite effect. Weights are also needed in computing both c and
,
although they are generally less important for the former, except when distinctive
segments of the population are oversampled. In this case ignoring weights (i.e.,
weights are considered constant for all sample members) contributes to biased
estimates in the direction of study measurements in the oversampled subgroup.
Descriptive analysis from a model-based perspective may be relevant for epidemiological
studies with a less explicitly defined study population, such as those that
might occur in case-control studies and clinical trials, where the sample is
not seen as representing any particular group.
The size of
may increase depending on the amount that weights vary. When weights and study
measurements are largely uncorrelated, one simple model for the multiplicative
effect of variable weights on
is
, where
and
are, respectively, the variance and mean of the sample weights (18, pp. 427-429).
To reduce this adverse effect on study estimates, widely variable weights are
sometimes "trimmed" by censoring and redistributing the original set
of weights. Unfortunately, weight trimming can also increase
,
so the approach of trimming is often dictated by minimizing impact on
(34).
For regression modeling one must consider whether coefficients
are attributes of the study population or the underlying population. In design-based
model fitting with inference to the study population all three key features
(i.e., weights, cluster sampling, and stratification) are needed, since failing
to incorporate them, particularly cluster sampling, tends to understate
and overstate significance from 0 in tests of coefficients (40). On the other
hand, in regression analysis taking a purely model-based path to the underlying
population, all three features can be ignored, provided one can justify the
assumption that error residuals associated with assumed model do not depend
on the size of weights, or the cluster or stratum of membership. This assumption
may work in less tightly defined study populations (e.g., in case-control studies)
but is often unjustified in studies with more definitive inference destinations,
thus necessitating some form of design accommodation. Some results suggest that
at the very least weights should be used as a guard against model misspecification
(7, 30). Others suggest using cluster and stratum identifiers as control variables
in the model may be useful in fitting the underlying model (22, 32).
CONCLUDING REMARKS
Regardless of the philosophical approach one takes to building the bridge between sample and study population, how the sample is designed, selected, and accounted for in analysis are important elements of any population-based epidemiological study. While theoretically based principles of sample design are well-established, we continue to examine the foundations of inference from the resulting samples
In this paper we have described the basic elements of a sample design and how one goes about combining these elements into an effective sampling plan. In the process, we have seen that many of the decisions one makes in producing and dealing with the sample depend on the path of statistical inference and are therefore less clear-cut. While consensus on precisely how one reflects the design in analysis has not been reached, it is generally agreed that certain features of the sample design are relevant to the task, the sample weights in particular. As more is sought from the results of population-based studies, new insights will be needed on how best to learn from the complex sample designs used in epidemiology.
Literature Cited