Disclaimer: The opinions expressed in this document are those of the author and may not represent the opinions of the U.S. Food and Drug Administration.
Type of data: continuous
Type of analysis: bivariate
Description and purpose:
Scatterplots are used to represent and explore the relationship between two continuous variables. The two axes represent the continuous variables, and each data point is plotted at the x,y coordinates corresponding to its values for the two variables. Scatterplots can help to identify whether two variables have a linear or non-linear relationship or whether they are not related to one another. Unusual observations, such as extreme values, possible coding errors, or possible outliers, can often be identified by scatterplots. Unusual clusters of observations that do not seem to conform to the general patter may be observed in a scatterplot. Exploring these divergent clusters of observations can lead to important insights.
Examples:
The following plots are based on the ADLBC data set available on the CTSPedia website. The scatterplots examine the relationship between blood glucose levels (mmol/L) at baseline and week 26 in the clinical trial.
Basic Scatterplots
Example 1
The scatterplot suggests a linear relationship between baseline blood glucose levels and the levels measured at week 26 in the trial. There is one extreme value at about baseline 22 mmol/L and week 26 15 mmol/L. While extreme, the value does not appear to be discrepant from the overall pattern in the data. There is a large cluster of observations in the general vicinity of baseline 5 mmol/L and week 26 5 mmol/L, and the large number of observations in this area of the graph has produced some overstrikes (two or more data points overlapping).
Scatterplots with overlays
A variety of overlays can be added to scatterplots to improve the visualization of the relationship between the two variables.
Confidence Ellipses
Confidence ellipses [i], [ii], [iii] draw an ellipse that contains a specified percentage of all the observations. Confidence ellipses can help to identify the nature of the relationship between the two variables. A narrow ellipse that is tilted away from the horizontal axis suggests a strong linear between the two variables. An ellipse that is closer in shape to a circle indicates that there is not a linear relationship between the two variables.
The confidence ellipse plots for the blood glucose lab results suggest a linear relationship between baseline and week 26 results.
Confidence Ellipses Overlays
Linear fit overlays
It may be helpful to overlay a scatterplot with a statistical model fit to the data. Here we add a line for the linear regression estimates and a pair of curves for the 95% confidence interval of the prediction.
Linear Fit OverlaysFractional polynomial fit overlays
It is also possible to include a non-linear model that has been fit to the data. Here we use a fractional polynomial model. The results are very similar to the linear model for these data.
Fractional fit OverlaysPlotting by sub-groups
Contrasting symbols
Scatterplots are often used to explore possible differences between subgroups. Men and women, for example, may respond differently to a medical treatment. The scatterplot below uses squares as the marker symbol for females and circles as the marker symbol for males. It is difficult to see any patterns by sex in the scatterplot because of the large cluster of overlapping points in the general vicinity of baseline 5 mmol/L and week 26 5 mmol/L.
Subgroup Overlays
Plotting by subgroups in separate panels
An alternative method to compare sub-groups is to plot the data in side-by-side panels. We have also added a linear fit overlay to these plots for this example. The differences between men and women are more clear in this scatterplot. The relationship between baseline and week 26 glucose values appears to be stronger for women, as indicated by the steeper slope of the fitted regression line. This may be influenced by the extreme value at about 22 mmol/L and week 26 15 mmol/L.
Subgroups - Separate Panels
Sunflower density plots
As mentioned previously, there is a cluster of values in the vicinity of baseline 5 mmol/L and week 26 5 mmol/L. The density of this cluster causes some overstrikes and makes it difficult to see the details of the distribution in a standard scatterplot. In these situations, an alternative approach is to use bivariate density plot, such a sunflower density plot. [iv] In a sunflower density plot, the plot area is divided into bins.
Sunflower
Scatterplots for Change versus Average Measurement
[i] Alexandersson, A. 1998. gr32: Confidence ellipses. Stata Technical Bulletin 46: 10-13. In Stata Technical Bulletin Reprints, vol. 8, 54-57. College Station, TX: Stata Press.
[ii] Alexandersson, A. 2004. Graphing confidence ellipses: An update of ellip for Stata 8. Stata Journal 4(3): 242-256.
[iii] McCartin, B. 2003. A Geometric Characterization of Linear Regression. Statistics: A Journal of Theoretical and Applied Statistics 37(2): 101-117.
[iv]Dupont, W. D., and W. D. Plummer Jr. 2003. Density distribution sunflower plots. Journal of Statistical Software 8: 1-11.
Reference:
Code ( ):
%CODE{lang="java"}% Code for Examples Stata 11 Software
********************************************************* * Scatterplot Examples for FDA-Industry-Academia Safety Graphics WG * Richard Forshee, FDA/CBER/OBE ** Last updated April 12, 2010 *********************************************************
clear
cd "C:\Documents and Settings\forshee\My Documents\CTSPedia"
use ADLBC // This data set is available on CTSPedia in SAS Transport format datasignature confirm // Confirming that the data have not changed
keep if lbtestcd=="GLUC" // Selecting only glucose tests
** Checking summary statistics for quality control summarize blstresn lbstresn bysort visit: summarize blstresn lbstresn
stem blstresn if visit=="WEEK 26" stem lbstresn if visit=="WEEK 26"
** Creating basic scatterplot comparing baseline and week 26 results
twoway (scatter lbstresn blstresn if visit=="WEEK 26", msize(small)), /// title("Blood Glucose Lab Results") /// subtitle("Baseline Compared to Week 26") /// xtitle("Baseline Result (mmol/L)") /// ytitle("Week 26 Result (mmol/L)") /// xscale(range(0 25)) yscale(range(0 25)) ylabel(0(5)25) /// aspectratio(1) xsize(3) ysize(4) /// legend(off)
graph export scatterbasic.eps, replace
* Generating data for a 50% and 90% confidence ellipse * Requires --ellip-- from the SSC ** Use --ssc install ellip-- to install
ellip lbstresn blstresn if visit=="WEEK 26", c(f) level(90) g(ey90 ex90) ellip lbstresn blstresn if visit=="WEEK 26", c(f) level(50) g(ey50 ex50)
** Creating a scatterplot with a 50% and 90% confidence ellipse overlay twoway /// (scatter lbstresn blstresn if visit=="WEEK 26", msize(small)) /// (scatter ey90 ex90, connect(l) msymbol(none)) /// (scatter ey50 ex50, connect(l) msymbol(none) lpattern(dash)), /// title("Blood Glucose Lab Results") /// subtitle("Baseline Compared to Week 26") /// xtitle("Baseline Result (mmol/L)") /// ytitle("Week 26 Result (mmol/L)") /// xscale(range(0 25)) yscale(range(0 25)) xlabel(0(5)25) ylabel(0(5)25) /// aspectratio(1) xsize(3) ysize(4) /// legend(order(2 3) rows(2) label(2 "90% of observations") /// label(3 "50% of observations"))
graph export scatterellip.eps, replace
** Creating a scatterplot with a linear regression overlay twoway /// (lfitci lbstresn blstresn if visit=="WEEK 26", clpattern(blank) ciplot(rline) alpattern(dash)) /// (scatter lbstresn blstresn if visit=="WEEK 26", msize(small)) /// (lfit lbstresn blstresn if visit=="WEEK 26"), /// title("Blood Glucose Lab Results") /// subtitle("Baseline Compared to Week 26") /// xtitle("Baseline Result (mmol/L)") /// ytitle("Week 26 Result (mmol/L)") /// xscale(range(0 25)) yscale(range(0 25)) ylabel(0(5)25) /// aspectratio(1) xsize(3) ysize(4) /// legend(row(2) order(4 1) label(1 "95% Confidence Interval of Prediction") /// label(4 "Linear Prediction"))
graph export scatterlfit.eps, replace
** Creating a scatterplot with a fractional polynomial overlay twoway /// (fpfitci lbstresn blstresn if visit=="WEEK 26", clpattern(solid) ciplot(rline) alpattern(dash)) /// (scatter lbstresn blstresn if visit=="WEEK 26", msize(small)), /// title("Blood Glucose Lab Results") /// subtitle("Baseline Compared to Week 26") /// xtitle("Baseline Result (mmol/L)") /// ytitle("Week 26 Result (mmol/L)") /// xscale(range(0 25)) yscale(range(0 25)) ylabel(0(5)25) /// aspectratio(1) xsize(3) ysize(4) /// legend(rows(2) order(2 1) label(1 "95% Confidence Interval of Prediction") /// label(2 "Linear Prediction"))
graph export scatterfpfit.eps, replace
** Scatterplot comparing men and women (single plot) twoway /// (scatter lbstresn blstresn if visit=="WEEK 26" & sex=="M", msize(small) msymbol(smcircle)) /// (scatter lbstresn blstresn if visit=="WEEK 26" & sex=="F", msize(small) msymbol(smsquare)), /// title("Blood Glucose Lab Results") /// subtitle("Baseline Compared to Week 26") /// xtitle("Baseline Result (mmol/L)") /// ytitle("Week 26 Result (mmol/L)") /// xscale(range(0 25)) yscale(range(0 25)) ylabel(0(5)25) /// aspectratio(1) xsize(3) ysize(4) /// legend(label(1 "Males") label(2 "Females"))
graph export scattermf.eps, replace
** Scatterplot comparing men and women in two panels with lfit overlay
twoway /// (lfitci lbstresn blstresn if visit=="WEEK 26" & sex=="M", clpattern(solid) ciplot(rline) alpattern(dash)) /// (scatter lbstresn blstresn if visit=="WEEK 26" & sex=="M", msize(small) msymbol(smcircle)), /// title("Blood Glucose Lab Results for Males") /// subtitle("Baseline Compared to Week 26") /// xtitle("Baseline Result (mmol/L)", margin(medium)) /// ytitle("Week 26 Result (mmol/L)", margin(medium)) /// xscale(range(0 25)) xlabel(0(5)25) yscale(range(0 25)) ylabel(0(5)25) /// aspectratio(1) xsize(3) ysize(4) /// legend(order(2 1) label(1 "95% CI") label(2 "Prediction"))
graph save male, replace
twoway /// (lfitci lbstresn blstresn if visit=="WEEK 26" & sex=="F", clpattern(solid) ciplot(rline) alpattern(dash)) /// (scatter lbstresn blstresn if visit=="WEEK 26" & sex=="F", msize(small) msymbol(smcircle)), /// title("Blood Glucose Lab Results for Females") /// subtitle("Baseline Compared to Week 26") /// xtitle("Baseline Result (mmol/L)", margin(medium)) /// ytitle("Week 26 Result (mmol/L)", margin(medium)) /// xscale(range(0 25)) xlabel(0(5)25) yscale(range(0 25)) ylabel(0(5)25) /// aspectratio(1) xsize(3) ysize(4) /// legend(order(2 1) label(1 "95% CI") label(2 "Prediction"))
graph save female, replace
graph combine male.gph female.gph graph export scattermf2panels.eps, replace
** Creating sunflower density plot twoway sunflower lbstresn blstresn if visit=="WEEK 26", /// title("Blood Glucose Lab Results") /// subtitle("Baseline Compared to Week 26") /// xtitle("Baseline Result (mmol/L)") /// ytitle("Week 26 Result (mmol/L)") /// xscale(range(0 25)) yscale(range(0 25)) xlabel(0(5)25) ylabel(0(5)25) /// aspectratio(1) xsize(3) ysize(4) /// legend(rows(3) label(1 "Single Observation"))
graph export sunflower.eps, replace
* * Plotting Change vs Average * Bland-Altman or Tukey mean-difference plot *
gen avg = (lbstresn+blstresn)/2 label var avg "Average glucose at baseline and visit"
gen diff = lbstresn-blstresn label var diff "Difference in glucose between visit and baseline"
twoway (scatter diff avg if visit=="WEEK 26", msize(small)), /// title("Blood Glucose at Baseline and Week 26") /// subtitle("Difference versus Average") /// xtitle("Average Blood Glucose at Baseline and Week 26 (mmol/L)") /// ytitle("Difference Between Blood Glucose" "at Week 26 and Baseline (mmol/L)") /// yscale(range(-10(5)10)) ylabel(-10(5)10)
graph export scatterblandaltman.eps, replace
%ENDCODE%