Return to Course Materials

Title: UCSF - Case Control Study Design

Start presentation

Slide 1: Case-Control Design

The phrase “study base” was first used by the epidemiologist Olli Miettinen, who is one of the main theorists of current epidemiological thinking. Others have proposed different language, the most common alternative probably being the phrase used in our text, “reference population” or “referent population” as used in some other texts.

Slide 2: Case-Control Key Concept #1

The main advantage of case-control designs is that it allows you to sample the experience of the study base most efficiently. Stated in other words, case-control designs allow you to make measurements on far fewer subjects than cohort studies but still get the same answer. The reason to do this is to conserve resources, something that is becoming more and more important these days as funding is drying up. A typical example is when expensive testing on stored biological samples are required for an analysis. It is often prohibitively costly to test everyone in the cohort.

Slide 3: Case-Control: Sampling Controls Within a Cohort

Understanding these three sampling methods is important because they will be linked to measures of association. Each sampling method will be linked to a particular measure of association.

In incidence density sampling the selection of controls is governed by the diagnoses of cases. Every time a case is diagnosed one or more controls are selected from other members of the cohort who, at that time, do not have the diagnosis. The term incidence density comes from the fact that the time of follow-up and the incidence of new disease are involved in determining eligible controls.

In our example of conserving resources by not testing all of the cohort members, the investigator would test stored biological samples only on those subjects chosen as controls. If the predictor variable were a questionnaire item everyone in the cohort had already answered, there wouldn’t be any point in selecting controls as the data is already available on the entire cohort.

The text book (and a number of others) call this design a “nested case-control study,” but nested is a imprecise term. It seems that it should more properly refer to any case control studies selecting controls from within a cohort study. In other words, all three of the sampling methods we are describing can be viewed as “nested” within a cohort.

Slide 4: Case-Control: IncidenceMove Density Sampling Design

Because the controls are selected randomly from those still under follow-up at the time as case is diagnosed, the sample of controls provides the same estimate of an association between a predictor and the outcome that one would obtain if all of those still under follow-up were used. The only difference between measuring the predictor (say, a blood test) on a random sample and on everyone would be the random error introduced by sampling. So the association between the predictor and the outcome will be the same plus or minus a random error introduced by sampling.

A somewhat subtler point that has to be considered in practice is what is meant by “at the time a case is diagnosed.” To implement this requires some definition of a time frame that is going to be considered “the same time.” The day of a case’s diagnosis would usually be too narrow and within a year would probably be too broad. This is a practical decision that depends on factors like how frequently cohort subjects are seen by the investigators and how frequently measurements are made.

Slide 5: Case-Control: Sample Baseline

Case-cohort design: sample baseline of cohort

“Case-cohort” is type of design you may not be acquainted with as it is relatively new and still has not been used frequently. It was first described by the statistician Ross Prentice in the 1980’s. It seems odd at first to realize that you will likely be sampling future cases as well as controls when you take a random sample of a cohort at its baseline. This means that a subject may be included both as a case and a control. But this is also true of incidence density sampling since a subject selected as a control at one time point may later become a case. This troubles many new to these sampling designs and results in their thinking that the best design must be to wait until the end of follow-up to select controls so that the investigator can be sure they will not be cases. This is not the right way to think about it. For starters, becoming a case is an artifact of the follow-up period of the cohort. The investigator cannot know whether many of the controls will be diagnosed with the study outcome the day after the study ends. This is made even clearer by the example of the cohort study that uses death as an outcome, as some do. Everyone is eventually a case. In summary, when we are looking for (i.e., sampling) controls, we do not necessarily have to guarantee that these are subjects who will never become cases. All that is needed is to be sure that they are not cases at the time of control sampling.

Slide 6: Case-Control: Prevalent Controls

Case-control design using prevalent controls at end of follow-up

This is the design that most neophytes are drawn to, as discussed in the notes on the previous slide on case-cohort design. There is an obvious source of potential bias in waiting until the end of follow-up to select controls because factors that influence loss to follow-up will influence the selection of controls. If those factors are associated with both your predictor variable and the outcome, the measure of association will be biased.

Slide 7: Primary Study Base

We have been discussing the study base as the population from which the cases arise, and we now introduce an important distinction between a primary and a secondary study base that occurs in attempting to define that population. If the study base can be clearly and explicitly defined as the members of a cohort, or the residents of a geographic area, or the members of a health care delivery system, we call that population a “primary study base.” The advantage of being able to identify a primary study base is that there is no uncertainty about the population from which the controls should be selected. They should be selected from the same primary study base that gave rise to the cases.

Slide 8: Case-Control Key Concept #2

Many cohorts are closed; The investigators recruit a population at baseline and follow them for some period of time and no new subjects are enrolled. But other cohorts are “open” (also called “dynamic”) in the sense that new members may be recruited as the cohort follow-up progresses (this is always true to some degree since at the beginning of a cohort all the members are almost never recruited on the same day and recruiting may be a lengthy process in some instances). For example, the San Francisco City Clinic Cohort Study of HIV/AIDS recruited additional men at several subsequent years after the baseline year; first, to increase the sample size and then later to increase the representation of minority ethnicities. A population like the members of Kaiser Permanente during some specified time period can be viewed as an open cohort because there new members are constantly being added.

A dedicated research cohort, such as the Women’s HIV Study, is a type of primary study base. In showing case-control sampling designs nested within a cohort, we have been illustrating sampling from a primary study base. We now move beyond a research cohort to other types of primary study bases that are available. The most commonly used primary study bases in clinical research are defined by geographic areas for which there is a disease registry that captures nearly all of cases or defined by institutional membership, especially institutions such as health care entities that capture the relevant medical information and diagnoses.

Slide 9: Case-Control: Density Sampling in a Dynamic Primary Study Base

It is easy to see the analogy between a study using this design and the graphic showing incidence density sampling nested within a cohort. Residents of a defined geographic area or health care system are treated as members of an open cohort. Some leave during the study time period and others move in. If there is a good disease registry, such as a cancer registry, which captures the diagnoses of interest, the cases are all known as they would be in a dedicated research cohort study. However, many diseases do not have registries and it may be difficult to identify all of the cases, especially if it is a common disease. The HMO setting is better for diseases without registries where the diagnoses can be identified in the medical records of the organization. It may be possible to identify all the cases of a rare disease by accessing record for all of the hospitals in a geographic area (creating the study’s own registry, in effect). But this can be a difficult and expensive process, and residents who seek care outside the area’s hospitals also have to be considered unless the proportion that do so is negligible.

Slide 10: Example of Case-Control Study with Density Sampling

Flick, et al. "Use of Nonsteroidal Antiinflammatory Drugs and Non-Hodgkin Lymphoma: A Population-based Case-Control Study." Am J Epidemiol 2006; Sept 1, 164:497-504.

Incidence density sampling in case-control design has become very popular in cancer epidemiology, and this is a typical example. There are many similar studies in the recent cancer literature. The study base for this research was six SF Bay Area counties. These counties are covered by the Surveillance, Epidemiology, and End Results (SEER) registry, the NCI-funded population-based registry that aims to identify all new cancer diagnoses and follow them for long-term outcomes (End Results). Since the implicit cohort that gave rise to all the NHL diagnoses during the period these data were collected (2001 – 2004) was all the residents of those counties, the controls were randomly selected from the same counties. Incidence density sampling is indicated by selecting the controls by random-digit dialing “at the time of diagnosis” of the case to which the control was matched. Matching on age and sex is common in cancer studies because they are strong confounders of cancer risk and matching increases the study’s efficiency. Matching on county serves to focus further the study base concept by regarding cases and controls in each of the six counties as subcohorts of the overall study.

Because cancers are relatively rare outcomes, large geographic populations are needed to accumulate a useful number of cases. The presence of cancer registries makes the design practical, and survey methodology is used, as in this example, to get an unbiased sample of the study base for the controls. Unfortunately, this approach is not possible for many diseases that do not have population-based disease registries. Beginning researchers are sometimes surprised to learn there is no comparable registry for cardiovascular events, for example.

Slide 11: Case-Control Studies from a Secondary Study Base

The concept of a secondary study base occurs because cases of a given disease may be identified, but they do not come from a clearly defined population such as a cohort, geographic area, or HMO. In other words, you have the cases first and you need to determine what study base gave rise to them. Typically these are cases of a disease seen in single hospital or other health care facility (or a limited number of hospitals/facilities). They may appear to come from a geographic area since hospitals that are not referral centers draw most of their patients from persons living in their geographical vicinity, but the difference is that the boundaries of the geographic area are difficult to determine, and there is no guarantee that many of the cases from the hospital’s catchment area are not being seen at other hospitals.

Think of taking the cases of a given disease in one San Francisco hospital and trying to decide what geographic area they represent. All of the patient addresses for persons with diagnosis during some time period could be mapped and a boundary drawn around them, but many other cases not seen at the study hospital were probably diagnosed within that boundary and seen at other hospitals. Without nearly complete case ascertainment, there is no way to know how the characteristics of the patients who chose to come to the study hospital differ from those who went elsewhere.

For the controls to come from the same study base as the cases, they need to be those persons who would come to the study hospital if they did have the disease of interest.

Since trying to identify a geographic catchment area to define the study base population does not work in most instances, it is necessary to give careful thought to what characteristics of the cases are causing them to show up at the study hospital. If those characteristics are associated with the study’s main predictor variable, substantial bias can occur in the measure of association by choosing inappropriate controls.

Slide 12: Primary vs. Secondary Base

Since there is no ambiguity about the population that gave rise to the cases when a primary study base can be identified, the problem for case-control studies is likely to be ascertaining all the cases for diseases without comprehensive registries. There may also be logistical problems with enrolling cases with poor survival because of the lag time between diagnosis and appearance in the registry. For example, a case-control study of glioma is likely to have difficulty contacting the cases quickly enough.

With a secondary study base, all of the cases are available since such designs usually start with new cases arriving at a hospital. Determining the population to sample for controls is the challenge. The threats to validity are generally much greater with a secondary study base.

This is a crucial distinction that explains a lot of weak case-control studies that have reaching erroneous conclusions and given case-control design a bad name in many circles. As can be seen in the example of the nested case-control study within a cohort, the results from a well-designed case-control study with a primary study base can be just as valid as a cohort study.

Slide 13: Secondary Study Base

Since UCSF is a major referral center from brain tumors, it is very difficult to determine the secondary study base for glioma cases. If patients with a different neurologic disease for which UCSF is also a referral center are patients who would also have come to UCSF had they been diagnosed with a glioma instead, they may represent a sample of the study base population that gave rise to the gliomas. But are all referral populations the same?

Another approach that has been used often for hospital case-control studies is to select neighborhood controls for the cases. This approach uses geography to identify possible controls, but it chooses a very narrow boundary for each case, such as the block across the street from the case’s address from which a random address is chosen. The assumption is that someone else with a glioma in that immediate neighborhood would also have come to UCSF. This may not be valid assumption.

Slide 14: Case-Control Key Concept #3

A type of prevalent sampling is to sample only prevalent cases. One of the advantages of doing a case-control study in a single hospital is the ready access to the cases, so in general it should be possible to pick up incident cases in the hospital setting. But even with the advantage of access, it may be difficult to include all the incident cases for diseases with poor survival. We give the example of glioma patients. It is possible that using prevalent cases may not bias the findings but there is usually no way to determine that and the possible bias could go in either direction with respect to the study hypothesis.

Q2: The effect of alcohol intake on the risk of breast cancer was investigated in the Netherlands. Caucasian women of Dutch nationality were included in the study when they lived (defined as being listed in the municipal population registry) in the geographic catchment areas of 17 hospitals during any part of the study period. During the study period, 168 cases of breast cancer in women who resided in this area were diagnosed. 548 controls were sampled from the municipal population register for the catchment areas of the hospitals. [van’t Veer et al. Int. J. Epidemiol. 18(1989):511-517]

Slide 15: Case-Control Key Concept #4

One of the major criticisms of the case-control design is that it is “retrospective.” This is correct in that the study is carried out after the disease experience of the study base has already occurred. But cohorts can also be retrospective. Looking back in time, a group of individuals is identified as a cohort (a typical example is a group of workers, such as shipyard workers in WWII) and then their disease experience over a period of time is investigated. In terms of study validity, though, the key question is not when is the study being carried out, but when were the measurements made and how good are they? The weakness in measuring predictors in a case-control study comes when subject recall is relied on. The strongest case-control studies look for measurements that were made in the past before the disease diagnosis; for example, measurements captured in medical records in an HMO. Retrospective cohort studies always depend on measurements made in the past. This is a limitation, but it can be overcome if the necessary data exists or, if the measurement is of a biological specimen, if biological samples have been stored and can be accessed.

Many important findings have come from well designed case-control studies. This Kaiser study, headed by Joe Selby, now the director of Kaiser’s research division, was the first to show strong evidence that screening sigmoidoscopy prevented colon cancer deaths. It had a substantial influence on clinical practice, yet it wasn’t a randomized trial, a study that would have required huge numbers and many years of follow-up. The study was feasible because Kaiser has a large membership, has been in existence for a long time, and has an excellent record keeping system.

Case-Control Design

The design is based on the primary study base of Kaiser membership over an 18-year period. All colon cancer deaths were identified and then the subset of those deaths that could have been detected by sigmoidoscopy. Using incidence-density sampling meant that at the date of each eligible colon cancer death, 4 Kaiser members were selected at random as controls from the membership enrolled at that time (note that any one of these controls could theoretically have later become cases—I don’t know if this in fact happened). The medical records of all cases and controls were reviewed to determine if they had undergone a prior sigmoidoscopy, and, if so, whether it was for indication or just a screening test. Screening sigmoidoscopy was the primary predictor and the results showed an approximately 3-fold increase in risk of colon cancer death among those not screened.

Slide 16: Critical Features of Good Case-Control Design

Case-control studies with all of these design features are a strong and valid study design that can produce results as convincing as any other type of observational study.

Q3: This abstract describes a case-control study carried out in Africa.

Ziegler, JL; Newton, R; Katongole-Mbidde, E; Mbulataiye, S; De Cock, K;Wabinga, H; Mugerwa, J; Katabira, E; Jaffe, H; Parkin, DM; Reeves, G; Weiss, R; Beral, V. Risk factors for Kaposi's sarcoma in HIV-positive subjects in Uganda. AIDS 1997 Nov, 11(13):1619-26. BACKGROUND: Kaposi's sarcoma (KS) is associated epidemiologically with HIV infection and with human herpesvirus 8 (HHV-8 or KSHV). Both KS and HIV infection are common in Uganda. We conducted a case-control study of 458 HIV-seropositive Ugandan adults with KS and 568 HIV-seropositive subjects without KS to examine risk factors for HIV-associated KS.

METHODS: We recruited newly diagnosed adult KS cases from KS clinics at five hospitals in Kampala, Uganda and adult controls from the general clinics for HIV infection at these 5 hospitals. All cases and controls were counselled and tested for HIV and answered an interviewer-administered questionnaire about their home, socio-economic conditions, lifestyle and sexual behaviour before they became ill. Only HIV-seropositive subjects were included in the analysis.

RESULTS: There were 295 males and 163 females with KS and 227 male and 341 female controls. KS cases were more likely than controls to have a higher level of education (P = 0.03) and to have occupations associated with affluence (P = 0.004). Cases were more likely than controls to have high household income (P < 0.001) and other markers of urban or rural wealth such as owning several cows (P = 0.002). CONCLUSIONS: Among HIV-infected subjects, KS cases are characterized by better education and greater affluence, compared with controls. The higher socio-economic status of persons with HIV and KS may be a marker for enhanced exposure to a possibly sexually transmitted agent, or for a delayed exposure to a childhood infection.