Start presentation

## Slide 1: Cumulative IncidenceMove

• Perhaps most intuitive measure of incidence since it is just proportion of those observed who got the disease
• Proportion=probability= risk
• Basis for Survival Analysis

Two primary methods for calculating
1. Kaplan-Meier method
2. Life table method

Both measures of incidence give valid incidence rates. They make some different assumptions and are associated with different analytic techniques, but both are very useful. Cumulative incidence is probably the most intuitive because it is just the proportion of persons who got the disease. It is easy to understand but, again, it only has meaning when a time period is attached to it. The probability of cancer in a one-year period is quite different from the lifetime probability of cancer.

CTSpedia.IncidenceMove based on a person-time denominator is a little less intuitive but it is actually the more fundamental measure of disease occurrence.

Of the two methods for calculating cumulative incidence, the life table method is older but is now seldom used except in actuarial tables of life expectancy and a few similar settings with large numbers of persons. The Kaplan-Meier method is very similar and has become the usual method for estimating cumulative incidence, so we will focus on its methodology.

## Slide 2: Calculating Cumulative IncidenceMove

• With complete follow-up cumulative incidence is just number of events (E) divided by the number of persons (N) = E/N
• Outbreak investigations, such as of gastrointestinal illness, typically calculate “attack rates” with complete follow-up on a “cohort” of persons who were exposed at the beginning of the epidemic.

The simplest situation in which to calculate cumulative incidence is if all of the persons are followed for the same length of time. In that case the cumulative incidence is simply the total number of events divided by the total number of persons. In long term cohort studies this never happens, but in the time limited outbreak investigations typical of a CDC investigation of gastroenteritis, it may well happen. And, although technically one still needs to attach a time period to the analysis (at one week, or some such), an outbreak of gastrointestinal illness is usually understood to be a matter of a few days, so even that element will probably be omitted.

Unfortunately, the term “attack rate” has traditionally been used to describe the proportion of persons who develop illness. As we have been arguing, this is an incorrect use of “rate,” since the denominator is just the number of persons investigated. Another example of how terminology in the literature can be confusing.

## Slide 3: Example of using denominator with complete follow-up

On June 24, 1996, the Livingston County (New York) Department of Health (LCDOH) was notified of a cluster of diarrheal illness following a party on June 22, at which approximately 30 persons had become ill …. Plesiomonas shigelloides and Salmonella serotype Hartford were identified as the cause of the outbreak… 98 attendees were interviewed. 56 (57%) of 98 respondents had illnesses meeting the case definition.

MMWR, May 22, 1998

Here is an example of such a CDC outbreak investigation. So the cumulative incidence of diarrheal illness in their cohort of 98 following the party was 57% (56/98). No time period is explicitly mentioned, but technically this is cumulative incidence over a few days following the party. Complete follow-up is possible because the time period is very short.

A similar situation could be found in a clinical trial, which you will recall is formally a type of cohort study. A very well run clinical trial of a non-fatal condition with relatively short follow-up time might achieve the same amount of follow-up time on everyone enrolled if everyone were enrolled on the same day (as in the gastroenteritis example everyone was exposed on the same day). But it is a rare study that enrolls everyone on one day, yet nearly all studies stop on a given day, resulting in some difference in follow-up time even if no one was lost or dropped out.

## Slide 4: Cumulative incidence with differing follow-up times

• Calculating cumulative incidence in a cohort
• Subjects have different starting dates
• Subjects have different follow-up after enrollment
• Most cohorts have a single ending date but different starting dates for participants because of the recruitment process
• Guarantees there will be unequal follow-up time
• In addition, very rare not to have drop outs

Having identical and complete follow-up on all subjects is the exception to the norm. A typical cohort study may take months to years to recruit all of the participants. Since most studies have a single ending date, subjects enrolled at the end of the recruiting process are going to have less follow-up time in the study than those enrolled at the beginning, even without having any subjects drop out, die, or be lost to follow-up. When those latter categories are also taken into account, there is usually great variation in how long different persons are observed in a cohort study. Aside from the variation in time followed, having individuals who are lost to follow-up are a major threat to the validity of conclusions about incidence. As we stressed last time, losses to follow-up are the primary problem in the validity of cohort studies.

Since cumulative incidence has to be defined as per some time period, it is incorrect to say, when follow-up times vary by individual, that the cumulative incidence in a cohort study that ran for 3 years was the proportion with the diagnosis divided by the number enrolled in the study. For some subjects who were only in the study for 6 months or a year, it is the incidence for those time periods. So the problem is how to assign a time period to the cumulative incidence when persons are followed for different times
• The Problem: Since rarely have equal follow-up on everyone, can’t just divide number of events by the number who were initially at risk
• The Solution: Kaplan-Meier and life tables are two methods devised to calculate cumulative incidence among persons with differing amounts of follow-up time

The two methods of solving this problem of calculating the cumulative incidence for different amounts of follow-up are called the Kaplan-Meier and the life table method. The life table approach is much older but is seldom seen in the medical literature these days because the Kaplan-Meier method has become the standard. We will focus on the Kaplan-Meier method although for large datasets life tables give the same answer.

## Slide 5: Cumulative incidence with Kaplan-Meier estimate

• Requires date last observed or date outcome occurred on each individual (end of study can be the last date observed)
• Analysis is performed by dividing the follow-up time into discrete pieces

calculate probability of survival at each event (survival = probability of no event)

The essence of the Kaplan-Meier (KM) method is having the date each outcome in the cohort occurred. Those dates divide the follow-up time of the cohort into a number of discrete pieces. The proportion surviving (probability) is calculated for each discrete piece and the overall cumulative probability of surviving is calculated by multiplying together the individual probabilities.

Every member of the cohort has to be assigned a date first seen and a date last seen or a date diagnosed.

3 Ways Censoring Occurs

Death (if death is not the study outcome)

Loss to follow-up (refused, moved, can’t be found)

End of study observation (if still alive and haven’t experienced outcome)

Each subject either experiences the outcome or is censored

To censoring caused by death we could also add diagnosis of another disease that makes the subject ineligible for further follow-up for the main disease under study. Such diagnoses would be specified in advance as censoring event.

## Slide 6: Example of Censoring - Calendar

http://twiki.library.ucsf.edu/twiki/bin/viewfile/CTSpedia/TICRDisOccurCumInc?rev=1;filename=KM1.JPG

This is the hypothetical example of a 10 person cohort study. It introduces the concept of censoring by illustrating how the individuals in the study started at different times, dropped out before the end of the study, or experienced the outcome (death in their example) before the end of the study. Here the times persons enter and leave the cohort are being shown on a calendar time scale. The solid black bars represent the length of time each person was followed in the study. D = death and C = censored observation, which means the subject either was lost to follow-up, refused to participate further in the study, or continued until the study was ended.

Person number 5 was enrolled at the very beginning and was still in the study at the end. Because this person was the only one with the full 24 months of study follow-up the text does not show this person as censored. Jeff Martine added a “C” to the figure because when these data are coded for analysis, each person must be coded as either dead or censored—there is no other possibility. For the analysis there is a single variable that indicates whether each person died or didn’t die. The “C” for persons still in the study at 24 months, the end of the study in calendar time, is often called administrative censoring. In terms of the analysis it doesn’t matter why a person is censored.

In this example, death is the outcome, but in studies with a disease diagnosis outcome death becomes a censoring event since it removes that individual for further follow-up.

## Slide 7: Example of Censoring - Time Zero

http://twiki.library.ucsf.edu/twiki/bin/viewfile/CTSpedia/TICRDisOccurCumInc?rev=1;filename=KM2.JPG

Assumption: No temporal/secular trends affecting incidence

To resolve the problem of different starting times, the analyst “shifts all the starting time to the left.” For the analysis, each person is going to be started at the same time zero. This graph shows how the data look when all of the different calendar starting times are reassigned the same follow-up starting time. Note that the time axis is now follow-up time rather than calendar time.

It should be clear that by assigning everyone the same starting time you are making the assumption that there are no calendar time trends that will affect your estimate of incidence. You have dropped calendar time, and hence trends associated with calendar time (often called “secular trends”), out of the analysis. For many studies this is probably a perfectly reasonable assumption. For one thing the difference in calendar time may be just a matter of months rather than years, so it would take a very rapidly changing trend to be important in a time frame of months. For some situations, though, this may be a dubious assumption. For example, if one were enrolling subjects to study a new infectious disease during its introductory epidemic period (HIV, SARS, ebola, etc.), temporal trends might affect incidence estimates significantly.

## Slide 8: Cumulative IncidenceMove Key Concept #1

Calculating cumulative incidence with different follow-up times, assumes the probability of the outcome is not changing during the study period

= no temporal/secular trends affecting the outcome.

This assumption does not mean that the probability of the outcome is the same during all of the follow-up time. Remember that by moving everyone to the left to start at study time zero, you are guaranteeing that one-year of follow-up in follow-up or study time will be different calendar times for individuals who were enrolled in the study on different dates. So this concept requires that the probability isn’t changing over calendar time. It may well change over study time as for many diseases risk increases with age.

Calculating Cumulative IncidenceMove
• Probability of two independent events occurring is the product of the two probabilities for each occurring alone
• eg, if event 1 occurs with probability 1/6 and event 2 with probability 1/2, then the probability of both event 1 and 2 occurring = 1/6 x 1/2 = 1/12
• Probability of living to time 2 given that one has already lived to time 1 is independent of the probability of living to time 1

In order to calculate cumulative incidence, you need to understand or least accept on faith the following. It is a fundamental theorem of probability that the cumulative probability of two independent events is the product of their individual probabilities. So the probability of flipping two heads in a row with a fair coin is ½ x ½ = ¼ . The Kaplan-Meier method of calculating the cumulative probability of the disease outcome is to treat each separate discrete piece of time as an independent trial. There was some probability of the outcome during the first time period; there was another probability of the outcome during the second time period. The probability of the outcome during both time periods together is the product of the individual probabilities.

Students sometimes balk at treating the two time periods as independent events. They say, “How can they be considered independent when it is many of the same persons in each time period”? The answer is that the probability in the second time period is conditional on a given person already having lived through the first time. So the probability of the outcome in the second period is the probability conditional on not having experienced the outcome up until that point in time. A similar mistake is made by gamblers who think that because a coin has come up tails four times in row the probability of heads on the next toss is better than ½. It isn’t.

http://twiki.library.ucsf.edu/twiki/bin/viewfile/CTSpedia/TICRDisOccurCumInc?rev=1;filename=KM_table.JPG

Cumulative survival calculated by multiplying probabilities for

each prior failure time: e.g., 0.9 x 0.875 x 0.857 = 0.675 and

0.9 x 0.875 x 0.857 x 0.800 x 0.667 x 0.500 = 0.180

Deaths occurred at 6 different times during follow-up, so there are 6 discrete pieces of time (in this example D = death whereas in the cohort graphic D = disease diagnosis). The probability of the event is the number of deaths at each point in time (just 1 here, but it is possible to have more than 1 at the same time) divided by the number in the cohort at that time. So at 1 month of follow-up there was a death and at that time all 10 original members of the cohort were still in follow-up. The probability of death was 1/10 and the probability of survival was 1 – 1/10 = 9/10. When the second diagnosis occurs at 3 months of follow-up, only 8 persons are still in follow-up because 1 person was lost to follow-up at 2 months of follow-up. The probability of death was 1/8, of survival was 7/8, and the cumulative probability of survival was 9/10 x 7/8 = 0.788. Why not calculate a probability of survival when the 1 person was lost at 2 months? Because the probability of survival for the 9 would be 9/9 = 1 and 1 times the previous cumulative survival leaves it unchanged.

## Slide 9: Kaplan-Meier Cumulative IncidenceMove of the Outcome

Cannot calculate by multiplying each event probability (=probability of repeating event)

(in our example, 0.100 x 0.125 x 0.143 x 0.200 x 0.333 x 0.500 = 0.0000595)

Obtain by subtracting cumulative probability of surviving from 1

eg, (1 - 0.180) = 0.82

Since it is a proportion, it has no time unit connected to it, so time period has to be added

e.g, 2-year cumulative incidence

The cumulative probability is calculated with the survival probabilities because it is only survival that happens repeatedly. To use the probability of the event each time you would be calculating a probability of repeated diagnoses, not what you want. At the end of multiplying together all of the individual survival probabilities to get the cumulative probability of 0.18, the cumulative probability of death can be obtained by subtracting from 1. 1 – 0.18 = 0.82.

NB: Cumulative incidence cannot be interpreted without specifying the time period. The cumulative incidence of death for the whole U.S. population at 1 year is about 0.8% but at 100 years it is greater than 99.9%. For our example it is cumulative incidence at 2 years.

## Slide 10: Kaplan-Meier: Survival after Breast Cancer

Survival After Breast Cancer in Ashkenazi Jewish BRCA1 and BRCA2 Mutation Carriers

http://twiki.library.ucsf.edu/twiki/bin/viewfile/CTSpedia/TICRDisOccurCumInc?rev=1;filename=BRCA_KM.JPG

Lee et al., JNCI 1999

Here is a graph showing a Kaplan-Meier analysis of cumulative survival after breast cancer among patients grouped by whether they carry either the BRCA1 or the BRCA2 breast cancer gene mutation (N=58) versus patients without either mutation (N=979). Notice that the lines are graphed in a stepwise fashion. This is because there is a discrete jump in the cumulative incidence each time a death occurred. Note also that the two curves lie on top of one another for about two years, but there is a suggestion that the mutation carriers have better survival beyond two years or so. This observation should be viewed skeptically, though, as the numbers have become very small among both groups by 40 months and especially among the carriers (N=3). In a Kaplan-Meier graphic large steps indicate big jumps in probability due to small numbers at risk. Hence, the tail of the curve does not give precise information.

To read cumulative survival for a group from the graph, pick a time point, such as 24 months, draw a line straight up to intersect the survival curve and then a horizontal line that intersects the y-axis. Where it intersects the y-axis is the estimate of the proportion surviving at 24 months of follow-up (about 44% in these data for either group).

(See Stata Do Files: Kaplan Meier for more information on doing Kaplan-Meier Analysis in STATA 10.)

## Slide 11: Cumulative IncidenceMove Key Concept #2

Censoring is unrelated to the probability of experiencing the outcome (unrelated to survival)

This concept takes us back to the point about the threat to validity of a cohort study coming from losses to follow-up (setting aside for the moment other issues such as measurement and confounding). Those persons lost to follow-up are the persons who are censored in the data analysis, so if the goal is to get an unbiased measure of incidence, those losses to follow-up cannot be related to the probability of the outcome. If members of the cohort who are leaving are either more or less likely to experience the outcome than those remaining, the incidence estimate will be either too high or too low.

Informative Censoring Among Patients Lost to follow-up After Initiation of Antiretroviral Therapy in Developing Countries

http://twiki.library.ucsf.edu/twiki/bin/viewfile/CTSpedia/TICRDisOccurCumInc?rev=1;filename=Cum_Inc_2.JPG Braitstein Lancet 2006

## Slide 12: Life table method of estimating cumulative incidence

• Key difference from Kaplan-Meier is that probabilities are calculated for fixed time intervals, not at the exact time of each event
• Unlike Kaplan-Meier, don’t need to know date of each event
• For large data sets the life table and the Kaplan-Meier method produce nearly the same results

Life tables, as the name implies, were first constructed to look at the cumulative survival (mortality) of large populations over a lifetime. So they used both large numbers of persons and long time periods. They differ from a Kaplan-Meier analysis in that they don’t require knowing the exact time of each death (event). A life table also precedes by dividing time up into discrete pieces, calculating survival probabilities for each piece, and obtaining a cumulative incidence by multiplying the individual probabilities together. But the time pieces are arbitrary, not determined by the time of each death (event)—hence, the lack of a need to know the time of each death (event). The time intervals are set by the investigator. For a lifetime life table they are typically 5-year intervals. For a cohort study, 1-year or 6-month intervals might be more typical.

## Slide 13: Summary Points

Prevalence counts existing disease and incidence counts new diagnoses of disease

Word “rate” is often used incorrectly

Two main types of incidence
• incidence based on proportion of persons = cumulative incidence
• incidence based on person-time = incidence rate
• Kaplan-Meier or life table estimates cumulative incidence assuming losses unrelated to outcome and no temporal trends in outcome incidence

## Slide 14: Problem Set - Disease Occurrence - Cumulative IncidenceMove

Q4: The probability that a participant in the Framingham Heart Disease study, which followed residents of Framingham, Mass. for many years for heart disease outcomes, experienced a myocardial infarction after 10 years of follow-up was estimated at 0.12. State the measure of disease occurrence (e.g., point prevalence, cumulative incidence, etc.) that is given or can be calculated from the information provided about this study.

Q5: Follow-up of 8 persons diagnosed with lung cancer.

A - Using the Kaplan-Meier method of estimating cumulative incidence, display in a table the data and the probabilities for the information below, In the figure below, C means censored and D means death. Person # 1 ------------------C 2 ------------------------------------------------------------C 3 -----------------------------------D 4 ------------------------------------------C 5 ------------D 6 ------------------------------------------------------------C 7 -----------------------------------------------D 8 ------------------------------------------------------------C 0......1…. 2…..3…..4…..5…..6…...7…..8…..9…..10 …………………..Follow-up Times (months)

B - Calculate the cumulative incidence of death at 8 months.

C - Draw a graph of the survival function.

Q6: Consider the following abstract from a study of the risk of type 1 diabetes in siblings of type 1 diabetic patients (Harjutsalo et al. Diabetes 54:563-569, 2005)

The aims of our analysis were to obtain the empirical risk estimates for type 1 diabetes in the siblings of a Finnish population-based cohort of childhood-onset diabetic patients and search for demographic and other factors predicting the risk of type 1 diabetes in siblings. We defined the diabetes status of all siblings of all probands who are included in the nationwide register of Finnish cases for whom type 1 diabetes was diagnosed before age 18 years between 1965 and 1979. Siblings’ diabetes status was ascertained by a record search of nationwide registries through 2001, and the type of diabetes and date of its manifestation were obtained from medical records. The total number of person-years during the follow-up was 405,685. Of the 10,168 siblings at risk, 647 (6.4%) had been diagnosed with type 1 diabetes by 2001. The cumulative incidence of type 1 diabetes by ages 10, 20, 30, 40, and 50 years in all siblings was 1.5, 4.1, 5.5, 6.4, and 6.9%, respectively. A young age at diagnosis in the index case, paternal young-onset diabetes, male sex, and older parental age at delivery considerably increased the risk of type 1 diabetes for siblings. This large prospective family study of type 1 diabetes in siblings of childhood-onset diabetic patients provides reliable empirical estimates for the sibling recurrence risk.

Note: Cumulative incidence was calculated using the Kaplan-Meier technique.

A - The abstract states that 647 (6.4%) of the siblings at risk developed type 1 diabetes. Interpret what this 6.4% means? Is it a useful estimate of incidence?

B - Based on the information provided in the abstract, would a sibling born in 1991 contribute data to the cumulative incidence to age 10 years? To 50 years? When do you believe was the earliest birth date of any sibling contributing to this analysis? Explain your answers.

C - The cumulative incidence of type I diabetes by age 20, among siblings of individuals with type 1 diabetes, is stated to be 4.1% in this article. However, in current reports (2001) the prevalence of type 1 diabetes among 20 year-old siblings of individuals with type 1 diabetes in this region is 12%. What could explain the discrepancy in these reports, other than chance? (Clue: we are not looking for a specific biological explanation but rather some general phenomena).

Q7: The following abstract and results are from a study of nursing home placements (Wang et al. Medical Journal of Australia 2001; 174: 271-275) Objective: To assess cumulative incidence and non-cognitive factors predicting nursing home placement in a defined older population.
Design and setting: Six-year follow-up of a population-based cohort living west of Sydney.
Participants: 3654 non-institutionalised residents aged 49 years or older (82.4% of those eligible) participated in baseline examinations during 1992 to 1994.
Main outcome measures: Permanent nursing home admission for long-term institutionalised aged care in New South Wales, confirmed by records of approvals by the regional Aged Care Assessment Team and subsidy payments by government.
Results: After excluding 384 participants who moved from the area or were lost to follow-up, 162 participants (5.0%) had been admitted to nursing homes on a permanent basis by October 1999. Six-year cumulative incidences for nursing home placement were 0.7%, 1.1%, 2.4%, 3.9%, 9.0%, 18.3% and 34.9% for people aged 55-59, 60-64, 65-69, 70-74, 75-79, 80-84 and 85 years or older, respectively. …
Conclusions: IncidenceMove rates of institutional aged care doubled for each five-year interval from the age of 60 years. A range of non-cognitive factors predict nursing home placement.

Describe the possible effect of these losses to follow-up (384 participants) on the reported cumulative incidence estimates. First, conceive of and describe circumstances (scenarios) where the reported cumulative incidence is an underestimate of truth. Second, conceive of and describe circumstances where the reported cumulative incidence is an overestimate of truth.