Return to Course Materials
Title: UCSF - Data Exploration
Slide 1: Multiple Predictor Regression
Slide 2: Data Types
Slide 3: More on Data Type
Slide 4: Model depends on outcome type
Slide 5: Data Exploration
Slide 6: Data Example
Slide 7: Descriptive Output
Slide 8: Cholesterol Data with Outlier
Slide 9: Histograms
Slide 10: Stata Commands
Slide 11: Boxplot
Slide 12: Using a Boxplot
Slide 13: Stata Command
Slide 14: qq Normal Plot
Slide 15: Using qqNormal Plot
Slide 16: Transforming variables
Slide 17: Frequency Tables
Slide 18: Summary
Start presentation
Slide 1: Multiple Predictor Regression
Assess the relationship between an outcome and multiple predictors
Powerful tool for:
understanding complex relationships
controlling confounding
prediction / risk stratification
Regression models differ by outcome type, but all have much in common
Slide 2: Data Types
Slide 3: More on Data Type
Data type implies plausible probability distn
Different data summaries the sample mean not always interpretable
Distinctions between different types can be flexible
Slide 4: Model depends on outcome type
Continuous -- linear or gamma model
Discrete (counts)- Poisson/negative binomial
Binary -- logistic, relative risk models, survival models when follow-up varies
All easily implemented in Stata
Slide 5: Data Exploration
Find data errors
Assess missingness
Detect anomalous observations and outlying data values
Select appropriate analysis methods
Support a formal data analysis
Slide 6: Data Example
Western Collaborative Group Study
Large early observational study (n=3154)
Association between "type A" behavior and coronary heart disease (CHD)
Example variable: systolic blood pressure
Slide 7: Descriptive Output
Slide 8: Cholesterol Data with Outlier
Slide 9: Histograms
Shows location, spread, and shape of the distribution
Horizontal axis: intervals or "bins" in which data values are grouped
Vertical axis: number, fraction, or percent of the observations in each bin
Interpreting Histograms
Pattern of bar heights conveys shape of distribution:
number of modes
skewness
long or short tails
Usefulness depends on number of bins
too many defeats goal of summarization
too few obscures shape of distribution
Slide 10: Stata Commands
histogram varname to graph a histogram
histogram varname, bin(x) histogram with x bars
histogram varname, freq histogram with frequency not fractions
Slide 11: Boxplot
Box with upper & lower hinges
Box: 25% tile, median, 75% tile
Length of box: interquartile range (IQR)
Lower hinge: 25% tile minus 1.5*IQR
Upper hinge: 75% tile plus 1.5*IQR
Values outside hinges: outliers 100
Slide 12: Using a Boxplot
Location: given by lines in box,
median
Spread: given by size of box,
IQR
Skewness: distance between the lines
Outliers are clearly marked
can usually tell how many and their values
Slide 13: Stata Command
graph box
varname
to graph a boxplot
graph box
varname
, over(
grpvar
)
side-by-side boxplots based on grpvar
group
varname1 varname2,
over(
grpvar
)
side-by-side boxplots for two variables
Slide 14: qq Normal Plot
Graphical approach to assessing Normality
Horizontal axis (x axis)
sorted data values
Vertical axis (y axis)
expected data values if data Normal
If plot straight, data is nearly Normal
Shape indicates nature violation, if any
Slide 15: Using qqNormal Plot
Right skew: plot curved up
Left skew: plot curved down
Outlier: values far off line
STATA
: qnorm varname
Slide 16: Transforming variables
Rationale:
Make outcome more normally distributed
Linearize predictor effects, remove interactions, equalize outcome variance
Drawbacks:
Untransformed variable more credible,interpretable
Natural scale may be more meaningful:
cost vs log cost
Slide 17: Frequency Tables
Used for categorical data
loses no information
Display raw numbers of percentages
Can be used for continuous data
discards lots of information
may create relevant groups
Slide 18: Summary
Types of Data: Numerical v. Categorical
Numerical: mean, SD, 5 numbers
Numerical: histogram, boxplot, qq normal
Categorical: Tables
Transformations: potentially useful