Geocoding: What is it and why is it necessary?

This article is derived from materials created by Dr. David Rehkopf.

Geocoding involves assigning an exact location (a numeric code) to participants in a study.


Why is geocoding necessary?

Geocoding is necessary because of the lack of socioeconomic data in most US public health surveillance systems. This is a problem because, absent this data, we cannot:

Geocoding public health surveillance data and using census-derived area-based socioeconomic measures (ABSMs) to characterize both the cases and population in the catchment area enables computation of rates stratified by the area-based measure of socioeconomic position. This can serve to close knowledge gaps which presently exist concerning which ABSMs, at which level of geography, would be most apt for monitoring US socioeconomic inequalities in health, overall and within diverse racial/ethnic-gender groups.

The Public Health Disparities Geocoding Project was launched to ascertain which ABSMs, at which geographic level (census block group [BG], census tract [CT], or ZIP code [ZC]), would be suitable for monitoring US socioeconomic inequalities in health. This project resulted in the recommendation that US public health surveillance data should be geocoded and routinely analyzed using the CT-level measure "percent of persons below poverty," thereby enhancing efforts to track, and improve accountability by addressing, social disparities in health in the United States. For details on this project, see http://www.hsph.harvard.edu/thegeocodingproject/webpage/monograph/execsummary.htm, from which this information was drawn.

How do I geocode my data?

A census-tract number can be obtained by goingto the following website:

http://factfinder.census.gov/servlet/AGSGeoAddressServlet?_lang=en&_programYear=50&_treeId=420

An address can be typed in, a year and program (such as Census 2000) can be selected, and census geographies, including CT, will be generated.

Issues may arise in the data to be geocoded include a number of P.O. boxes that cannot be mapped, or if address cleaning is necessary (particularly, if the data set is larger than a few hundred observations). Address-cleaning macros may be obtained in this latter case. The U.S. Census website is an excellent resource for Census information and documentation, and all the data you need resides somewhere on this site. However, there are some details involved in obtaining and using U.S. Census dta that are not immediately clear. Much like geocoding, using the U.S. Census data involves a number of decisions that can have a substantial impact on study results.

To put census data in the context of the process of geocoding cases and creating rates for the outcomes of interest there are 3 distinct types of data needed:
  1. Geocoded Case Data, which provides data for the numerator
  2. Population Count Data, which provides data for the denominator, and
  3. Area Based Socioeconomic Measure Data.
The focus here is on year 2000 census data, and while for the most part similar, there are some differences with prior years' census data.

Year 2000 U.S. Census data can be obtained at http://www.census.gov. The data consist of:

I. 2000 U.S. Census data: an overview

The 2000 census data is organized into 4 summary files, or (SF). These summary files are from 2 different sources:
  1. Short form data, which is based on data collected on every person in the U.S., and
  2. Long form data, which is based on more detailed forms sent to a sample of the U.S. population, approximately 1 in 6.
Two summary files, SF1 and SF2, are from complete count data. The data in SF1 and SF2 are tables reporting counts by age, gender, race, and Ethnicity. The difference is that in SF2 there is much more detailed information on race/Ethnicity, for example, specific country of origin, Native American tribe, and whether there were multiple race/Ethnicities reported. An important note is that in these detailed tables in SF2, there is data suppression. Specifically, for tables with less than 100 individuals, data is not released.

The other two summary files, SF3 and SF4, contain data from the long form which is taken from the 1 in 6 sample of the population. SF3 includes detailed tables with socioeconomic and housing data. This is the source of data for more detailed population information such as on poverty, crowding, occupation, and it is this SF3 data that is used to create ABSMs.

A. Summary File Data (http://www.census.gov/main/www/cen2000.html)
  1. SF1: short form, complete count data, (age, gender, race, Ethnicity)
  2. SF2: short form, complete count data, (detailed race and Ethnicity)
  3. SF3: long form, data based on sample, (socioeconomic and housing data)
  4. Technical note on SF1 vs. SF3: (http://www.census.gov/Press-Release/www/2002/sf3compnote.html)
Differences between SF1 (taken from a complete enumeration of the population) and SF3 (taken from a sample of the population) counts:

To be clear, this is in general useful. Because for the more detailed data from the long forms, this sampling and weighting results in more accurate estimates of detailed SF3 data. The mismatch that occurs is for small areas and particularly for variables which result in small strata.

B. Recommendations
  1. For Denominator data, use SF1 data
  2. For creating ABSMs, use SF3 data (socioeconomic and housing data)
  3. Remember that "the official values for items reported on the short form come from SF1 and SF2" (U.S. Census Bureau Documentation)

II. Sources of U.S. Census data

A. American Factfinder: These data come directly from the U.S. Census. Factfinder is best for using data from a small number of areas (http://factfinder.census.gov/home/saff/main.html?_lang=en).

Advantages: Disadvantages: B. FTP raw data: Best for using data from a large number of areas

C. DataFerrett (http://dataferrett.census.gov) is a data mining tool (free, downloadable software) that accesses data stored in TheDataWeb through the internet

D. Commercial Vendor

Advantages: Disadvantages:

III. Using raw FTP U.S. Census data

Considering these three options, when doing an analysis at the CT level for a large area such as a state, it is recommended to obtain raw U.S. Census data using FTP downloaded raw data from the census. [This discussion will be related to SAS, but the same principles and process applies for other database programs, and code is available from the census for entering data. Census documentation also notes that some state data centers have code for inputting data into SPSS, but this varies by state.]

There are three basic steps to this process:

A. Step 1. Figure out what data are needed and where they are located
  1. SF1 Technical documentation, FTP read me, FTP download at http://www.census.gov/Press-Release/www/2001/sumfile1.html
  2. SF3 Technical documentation, FTP read me, FTP download at http://www.census.gov/Press-Release/www/2002/sumfile3.html
N.B. The Summary File document is over 600 pages in length, so it is best to just look at it on your computer and not try and print it. The computer version also has a very hand search text feature in Adobe Acrobat that works very well.

Census data are organized into tables, and Chapter 6 of this document contains documentation of what information is in each table.

B. Step 2. Download raw data and geographic header file
  1. Use "FTP read me" doc to determine file that has table of interest
  2. Download raw data file and geographic header file
At the FTP site, in the file for whatever state you are interested in (data is listed separately by state), you can now go and download the ZIPed file (e.g., ma0002_uf1.zip is the file for 02 Massachusetts data, with the naming convention being state, file name, and f1 for summary file 1)

Very importantly, you will also need to download the geographic header file, with the naming convention (in this case) "mageo_uf1.zip".

C. Step 3. Download appropriate SAS program and read in raw data
  1. SF1: http://www.census.gov/support/SF1ASCII.html
  2. SF3: http://www.census.gov/support/SF3ASCII.html
  3. Go to "Using SAS," click download to get zipped SAS files
  4. Modify SAS programs as indicated (path to file, merge, SUMLEV=; e.g., census tract for 2000 is sumlev "080")
After unzipping this file, choose the file that corresponds to the data you downloaded.

The procedure for obtaining ABSM data from SF3 is identical to the process for obtaining the denominator data from SF1. For example, if you wanted to create an ABSM for poverty, you could go into the SF3 documentation, and find table P87, and go through the same three basic steps of then downloading the data and the SAS code and reading the data into SAS.

Summary and recommendations for using 2000 Census Data This is really a relatively painless process of obtaining and manipulating the census data. If starting only with a familiarity with SAS, it really is quite easy to get and use this data.