-- BarbaraGrimes - 13 Mar 2009

Table1 Preliminary Checklist

First check "N=" Is this the number of observations you were expecting?

Continuous Variables:
  1. Are the mean and median values reasonable?
  2. Are the mean and median values similar? If not, you may need to think about transforming this variable or creating a categorical version. If the mean is much larger than the median, the variable may be right-skewed. This may suggest that a predictor variable will not meet the linearity assumption if modeled as an untransformed continuous predictor in regression models, in which case consider logarithmic transformation or modeling as a categorical predictor (e.g., quartiles). For an outcome variable, skewness may suggest that modeling it as an untransformed numeric outcome may not be the most meaningful scale, for example, if a 1 point difference is more important at low values of the outcome than it is at high values of the outcome. Logarithmic transformation may again be worth considering; this is appropriate if, for example, a 50% difference is equally important at both low and high values of the outcome.
  3. For variables that can only take on non-negative values, is the SD more than half as large as the mean? This also indicates right-skewness.
  4. Are the min and max values (that are shown with the median ) consistent with the possible values for this variable? If not there is probably a data problem that should be investigated and systematically corrected.
  5. Are the min or max more than 3 SD's from the mean? If so, these may be overly-influential outliers that need to be considered when presenting summaries and performing analyses. Median is often a better summary than mean when outliers are present. Logarithmic transformation often helps with positive outliers. DFBeta's can be calculated for regression models to assess the impact each observation on the regression estimates. Sensitivity analyses can be performed deleting outliers.
  6. If dates are shown, consider the min and max dates. Are they reasonable? Date values may be corrupted when moving from one source to another, eg from an EXCEL spreadsheet to SAS.
Categorical variables:
  1. Are the categories labeled correctly? Look for misspelled words. Often misspellings occur in data eg Mlae for Male. The computer considers those as two different categories.
  2. Do some variables have a large number of missing values? Was this expected or is there a problem with the source data?
  3. Do some categories need to be combined due to small counts? Especially consider categories with counts less than 5.