Return to Course Materials

# Title: UCSF - Data Exploration

Start presentation

## Slide 1: Multiple Predictor Regression

• Assess the relationship between an outcome and multiple predictors
• Powerful tool for:
• understanding complex relationships
• controlling confounding
• prediction / risk stratification
• Regression models differ by outcome type, but all have much in common

## Slide 3: More on Data Type

• Data type implies plausible probability distn
• Different data summaries the sample mean not always interpretable
• Distinctions between different types can be flexible

## Slide 4: Model depends on outcome type

• Continuous -- linear or gamma model
• Discrete (counts)- Poisson/negative binomial
• Binary -- logistic, relative risk models, survival models when follow-up varies
• All easily implemented in Stata

## Slide 5: Data Exploration

• Find data errors
• Assess missingness
• Detect anomalous observations and outlying data values
• Select appropriate analysis methods
• Support a formal data analysis

## Slide 6: Data Example

• Western Collaborative Group Study
• Large early observational study (n=3154)
• Association between "type A" behavior and coronary heart disease (CHD)
• Example variable: systolic blood pressure

## Slide 9: Histograms

• Shows location, spread, and shape of the distribution
• Horizontal axis: intervals or "bins" in which data values are grouped
• Vertical axis: number, fraction, or percent of the observations in each bin

## Interpreting Histograms

• Pattern of bar heights conveys shape of distribution:
• number of modes
• skewness
• long or short tails
• Usefulness depends on number of bins
• too many defeats goal of summarization
• too few obscures shape of distribution

## Slide 10: Stata Commands

• histogram varname to graph a histogram
• histogram varname, bin(x) histogram with x bars
• histogram varname, freq histogram with frequency not fractions

## Slide 11: Boxplot

• Box with upper & lower hinges
• Box: 25% tile, median, 75% tile
• Length of box: interquartile range (IQR)
• Lower hinge: 25% tile minus 1.5*IQR
• Upper hinge: 75% tile plus 1.5*IQR
• Values outside hinges: outliers 100

## Slide 12: Using a Boxplot

• Location: given by lines in box, median
• Spread: given by size of box, IQR
• Skewness: distance between the lines
• Outliers are clearly marked can usually tell how many and their values

## Slide 13: Stata Command

• graph box varname to graph a boxplot
• graph box varname, over( grpvar) side-by-side boxplots based on grpvar
• group varname1 varname2, over( grpvar) side-by-side boxplots for two variables

## Slide 14: qq Normal Plot

• Graphical approach to assessing Normality
• Horizontal axis (x axis) sorted data values
• Vertical axis (y axis) expected data values if data Normal
• If plot straight, data is nearly Normal
• Shape indicates nature violation, if any

## Slide 15: Using qqNormal Plot

• Right skew: plot curved up
• Left skew: plot curved down
• Outlier: values far off line
• STATA: qnorm varname

## Slide 16: Transforming variables

Rationale:

• Make outcome more normally distributed
• Linearize predictor effects, remove interactions, equalize outcome variance

Drawbacks:

• Untransformed variable more credible,interpretable
• Natural scale may be more meaningful: cost vs log cost

## Slide 17: Frequency Tables

• Used for categorical data loses no information
• Display raw numbers of percentages
• Can be used for continuous data
• discards lots of information
• may create relevant groups

## Slide 18: Summary

• Types of Data: Numerical v. Categorical
• Numerical: mean, SD, 5 numbers
• Numerical: histogram, boxplot, qq normal
• Categorical: Tables
• Transformations: potentially useful