Geocoding: What is it and why is it necessary?

This article is derived from materials created by Dr. David Rehkopf.

Geocoding involves assigning an exact location (a numeric code) to participants in a study.

Why is geocoding necessary?
How do I geocode my data?
How do I link my geocoded data to census SES data?

Why is geocoding necessary?

Geocoding is necessary because of the lack of socioeconomic data in most US public health surveillance systems. This is a problem because, absent this data, we cannot:

monitor socioeconomic inequalities in US health;
ascertain their contribution to racial/ethnic and gender inequalities in health;and
galvanize public concern, debate, and action concerning how we, as a nation, can achieve the vital goal of eliminating social disparities in health.

Geocoding public health surveillance data and using census-derived area-based socioeconomic measures (ABSMs) to characterize both the cases and population in the catchment area enables computation of rates stratified by the area-based measure of socioeconomic position. This can serve to close knowledge gaps which presently exist concerning which ABSMs, at which level of geography, would be most apt for monitoring US socioeconomic inequalities in health, overall and within diverse racial/ethnic-gender groups.

The Public Health Disparities Geocoding Project was launched to ascertain which ABSMs, at which geographic level (census block group [BG], census tract [CT], or ZIP code [ZC]), would be suitable for monitoring US socioeconomic inequalities in health. This project resulted in the recommendation that US public health surveillance data should be geocoded and routinely analyzed using the CT-level measure "percent of persons below poverty," thereby enhancing efforts to track, and improve accountability by addressing, social disparities in health in the United States. For details on this project, see http://www.hsph.harvard.edu/thegeocodingproject/webpage/monograph/execsummary.htm, from which this information was drawn.

How do I geocode my data?

A census-tract number can be obtained by goingto the following website:

http://factfinder.census.gov/servlet/AGSGeoAddressServlet?_lang=en&_programYear=50&_treeId=420

An address can be typed in, a year and program (such as Census 2000) can be selected, and census geographies, including CT, will be generated.

Issues may arise in the data to be geocoded include a number of P.O. boxes that cannot be mapped, or if address cleaning is necessary (particularly, if the data set is larger than a few hundred observations). Address-cleaning macros may be obtained in this latter case.

How do I link my geocoded data to census SES data?

The U.S. Census website is an excellent resource for Census information and documentation, and all the data you need resides somewhere on this site. However, there are some details involved in obtaining and using U.S. Census dta that are not immediately clear. Much like geocoding, using the U.S. Census data involves a number of decisions that can have a substantial impact on study results.

To put census data in the context of the process of geocoding cases and creating rates for the outcomes of interest there are 3 distinct types of data needed:

Geocoded Case Data, which provides data for the numerator
Population Count Data, which provides data for the denominator, and
Area Based Socioeconomic Measure Data.

The focus here is on year 2000 census data, and while for the most part similar, there are some differences with prior years' census data.

Year 2000 U.S. Census data can be obtained at http://www.census.gov. The data consist of:

I. 2000 U.S. Census data: an overview

The 2000 census data is organized into 4 summary files, or (SF). These summary files are from 2 different sources:

Short form data, which is based on data collected on every person in the U.S., and
Long form data, which is based on more detailed forms sent to a sample of the U.S. population, approximately 1 in 6.

Two summary files, SF1 and SF2, are from complete count data. The data in SF1 and SF2 are tables reporting counts by age, gender, race, and Ethnicity. The difference is that in SF2 there is much more detailed information on race/Ethnicity, for example, specific country of origin, Native American tribe, and whether there were multiple race/Ethnicities reported. An important note is that in these detailed tables in SF2, there is data suppression. Specifically, for tables with less than 100 individuals, data is not released.

The other two summary files, SF3 and SF4, contain data from the long form which is taken from the 1 in 6 sample of the population. SF3 includes detailed tables with socioeconomic and housing data. This is the source of data for more detailed population information such as on poverty, crowding, occupation, and it is this SF3 data that is used to create ABSMs.

A. Summary File Data (http://www.census.gov/main/www/cen2000.html)

SF1: short form, complete count data, (age, gender, race, Ethnicity)
SF2: short form, complete count data, (detailed race and Ethnicity)
SF3: long form, data based on sample, (socioeconomic and housing data)
Technical note on SF1 vs. SF3: (http://www.census.gov/Press-Release/www/2002/sf3compnote.html)

Differences between SF1 (taken from a complete enumeration of the population) and SF3 (taken from a sample of the population) counts:

SF3 counts are weighted estimates for an area, and there are wider confidence limits for smaller areas
In the 2000 Census, the weighting areas used were where 200 or more long forms were completed, giving more accurate estimates;
However, this results in a mismatch between SF1 and SF3 estimates for small areas or strata.

To be clear, this is in general useful. Because for the more detailed data from the long forms, this sampling and weighting results in more accurate estimates of detailed SF3 data. The mismatch that occurs is for small areas and particularly for variables which result in small strata.

B. Recommendations

For Denominator data, use SF1 data
For creating ABSMs, use SF3 data (socioeconomic and housing data)
Remember that "the official values for items reported on the short form come from SF1 and SF2" (U.S. Census Bureau Documentation)

II. Sources of U.S. Census data

A. American Factfinder: These data come directly from the U.S. Census. Factfinder is best for using data from a small number of areas (http://factfinder.census.gov/home/saff/main.html?_lang=en).

Advantages:

easy to use
all census data available
data from offical source
output to html, Excel, comma delimited text
free

Disadvantages:

time consuming if downloading data from more than a few areas

B. FTP raw data: Best for using data from a large number of areas

C. DataFerrett (http://dataferrett.census.gov) is a data mining tool (free, downloadable software) that accesses data stored in TheDataWeb through the internet

D. Commercial Vendor

Advantages:

may be easier to manipulate data for a large number of areas
special packages allow for built-in mapping and longitudinal analysis
Windows-based

Disadvantages:

cost ($1500 for year 2000, $4500 for 1970-2000 for the whole United States for a single user; the cost for one state is roughly half that price. For 20-25 users, full package is ~$20,000 or $10,000 for one state)
all census data may not be available
data manipulation and compression may occur in order to fit data on CDs

III. Using raw FTP U.S. Census data

Considering these three options, when doing an analysis at the CT level for a large area such as a state, it is recommended to obtain raw U.S. Census data using FTP downloaded raw data from the census. [This discussion will be related to SAS, but the same principles and process applies for other database programs, and code is available from the census for entering data. Census documentation also notes that some state data centers have code for inputting data into SPSS, but this varies by state.]

There are three basic steps to this process:

A. Step 1. Figure out what data are needed and where they are located

SF1 Technical documentation, FTP read me, FTP download at http://www.census.gov/Press-Release/www/2001/sumfile1.html
SF3 Technical documentation, FTP read me, FTP download at http://www.census.gov/Press-Release/www/2002/sumfile3.html

N.B. The Summary File document is over 600 pages in length, so it is best to just look at it on your computer and not try and print it. The computer version also has a very hand search text feature in Adobe Acrobat that works very well.

Census data are organized into tables, and Chapter 6 of this document contains documentation of what information is in each table.

B. Step 2. Download raw data and geographic header file

Use "FTP read me" doc to determine file that has table of interest
Download raw data file and geographic header file

At the FTP site, in the file for whatever state you are interested in (data is listed separately by state), you can now go and download the ZIPed file (e.g., ma0002_uf1.zip is the file for 02 Massachusetts data, with the naming convention being state, file name, and f1 for summary file 1)

Very importantly, you will also need to download the geographic header file, with the naming convention (in this case) "mageo_uf1.zip".

C. Step 3. Download appropriate SAS program and read in raw data

SF1: http://www.census.gov/support/SF1ASCII.html
SF3: http://www.census.gov/support/SF3ASCII.html
Go to "Using SAS," click download to get zipped SAS files
Modify SAS programs as indicated (path to file, merge, SUMLEV=; e.g., census tract for 2000 is sumlev "080")

After unzipping this file, choose the file that corresponds to the data you downloaded.

The procedure for obtaining ABSM data from SF3 is identical to the process for obtaining the denominator data from SF1. For example, if you wanted to create an ABSM for poverty, you could go into the SF3 documentation, and find table P87, and go through the same three basic steps of then downloading the data and the SAS code and reading the data into SAS.

Summary and recommendations for using 2000 Census Data

We recommend obtaining data directly from the U.S. Census, using Factfinder for simple queries, FTP download of raw data for analyses involving many areas.
We recommend using SF1 data for denominator data, and SF3 data for creating ABSMs
Completedocumentation of what 2000 Census data is available in which table is available at www.census.gov.
To use FTP raw data, you need: raw data, geographic header file, and SAS code to read in data.

This is really a relatively painless process of obtaining and manipulating the census data. If starting only with a familiarity with SAS, it really is quite easy to get and use this data.