Return to Safety Graphics Home

Scatterplots and Bivariate Density Plots

Last updated by Richard Forshee on April 14, 2011

Disclaimer: The opinions expressed in this document are those of the author and may not represent the opinions of the U.S. Food and Drug Administration.

Type of data: continuous

Type of analysis: bivariate

Description and purpose:

Scatterplots are used to represent and explore the relationship between two continuous variables. The two axes represent the continuous variables, and each data point is plotted at the x,y coordinates corresponding to its values for the two variables. Scatterplots can help to identify whether two variables have a linear or non-linear relationship or whether they are not related to one another. Unusual observations, such as extreme values, possible coding errors, or possible outliers, can often be identified by scatterplots. Unusual clusters of observations that do not seem to conform to the general patter may be observed in a scatterplot. Exploring these divergent clusters of observations can lead to important insights.

Examples:

The following plots are based on the ADLBC data set available on the CTSPedia website. The scatterplots examine the relationship between blood glucose levels (mmol/L) at baseline and week 26 in the clinical trial.

Basic Scatterplots

Example 1

The scatterplot suggests a linear relationship between baseline blood glucose levels and the levels measured at week 26 in the trial. There is one extreme value at about baseline 22 mmol/L and week 26 15 mmol/L. While extreme, the value does not appear to be discrepant from the overall pattern in the data. There is a large cluster of observations in the general vicinity of baseline 5 mmol/L and week 26 5 mmol/L, and the large number of observations in this area of the graph has produced some overstrikes (two or more data points overlapping).

Scatterplots with overlays

A variety of overlays can be added to scatterplots to improve the visualization of the relationship between the two variables.

Confidence Ellipses

Confidence ellipses [i], [ii], [iii] draw an ellipse that contains a specified percentage of all the observations. Confidence ellipses can help to identify the nature of the relationship between the two variables. A narrow ellipse that is tilted away from the horizontal axis suggests a strong linear between the two variables. An ellipse that is closer in shape to a circle indicates that there is not a linear relationship between the two variables.

The confidence ellipse plots for the blood glucose lab results suggest a linear relationship between baseline and week 26 results.

Confidence Ellipses Overlays

Linear fit overlays

It may be helpful to overlay a scatterplot with a statistical model fit to the data. Here we add a line for the linear regression estimates and a pair of curves for the 95% confidence interval of the prediction.

Linear Fit Overlays

Fractional polynomial fit overlays

It is also possible to include a non-linear model that has been fit to the data. Here we use a fractional polynomial model. The results are very similar to the linear model for these data.

Fractional fit Overlays

Plotting by sub-groups

Contrasting symbols

Scatterplots are often used to explore possible differences between subgroups. Men and women, for example, may respond differently to a medical treatment. The scatterplot below uses squares as the marker symbol for females and circles as the marker symbol for males. It is difficult to see any patterns by sex in the scatterplot because of the large cluster of overlapping points in the general vicinity of baseline 5 mmol/L and week 26 5 mmol/L.

Subgroup Overlays

Plotting by subgroups in separate panels

An alternative method to compare sub-groups is to plot the data in side-by-side panels. We have also added a linear fit overlay to these plots for this example. The differences between men and women are more clear in this scatterplot. The relationship between baseline and week 26 glucose values appears to be stronger for women, as indicated by the steeper slope of the fitted regression line. This may be influenced by the extreme value at about 22 mmol/L and week 26 15 mmol/L.

Subgroups - Separate Panels

Sunflower density plots

As mentioned previously, there is a cluster of values in the vicinity of baseline 5 mmol/L and week 26 5 mmol/L. The density of this cluster causes some overstrikes and makes it difficult to see the details of the distribution in a standard scatterplot. In these situations, an alternative approach is to use bivariate density plot, such a sunflower density plot. [iv] In a sunflower density plot, the plot area is divided into bins.

Sunflower

Scatterplots for Change versus Average Measurement

{background information here}

Change vs Average


[i] Alexandersson, A. 1998. gr32: Confidence ellipses. Stata Technical Bulletin 46: 10-13. In Stata Technical Bulletin Reprints, vol. 8, 54-57. College Station, TX: Stata Press.

[ii] Alexandersson, A. 2004. Graphing confidence ellipses: An update of ellip for Stata 8. Stata Journal 4(3): 242-256.

[iii] McCartin, B. 2003. A Geometric Characterization of Linear Regression. Statistics: A Journal of Theoretical and Applied Statistics 37(2): 101-117.

[iv]Dupont, W. D., and W. D. Plummer Jr. 2003. Density distribution sunflower plots. Journal of Statistical Software 8: 1-11.

Reference:

Code ( ):

%CODE{lang="java"}% Code for Examples Stata 11 Software

******************************************************** * * Scatterplot Examples for FDA-Industry-Academia Safety Graphics WG * Richard Forshee, FDA/CBER/OBE * * Last updated April 12, 2010 * ********************************************************

clear

cd "C:\Documents and Settings\forshee\My Documents\CTSPedia"

use ADLBC // This data set is available on CTSPedia in SAS Transport format datasignature confirm // Confirming that the data have not changed

keep if lbtestcd=="GLUC" // Selecting only glucose tests

** Checking summary statistics for quality control summarize blstresn lbstresn bysort visit: summarize blstresn lbstresn

stem blstresn if visit=="WEEK 26" stem lbstresn if visit=="WEEK 26"

** Creating basic scatterplot comparing baseline and week 26 results

twoway (scatter lbstresn blstresn if visit=="WEEK 26", msize(small)), /// title("Blood Glucose Lab Results") /// subtitle("Baseline Compared to Week 26") /// xtitle("Baseline Result (mmol/L)") /// ytitle("Week 26 Result (mmol/L)") /// xscale(range(0 25)) yscale(range(0 25)) ylabel(0(5)25) /// aspectratio(1) xsize(3) ysize(4) /// legend(off)

graph export scatterbasic.eps, replace

* Generating data for a 50% and 90% confidence ellipse * Requires --ellip-- from the SSC ** Use --ssc install ellip-- to install

ellip lbstresn blstresn if visit=="WEEK 26", c(f) level(90) g(ey90 ex90) ellip lbstresn blstresn if visit=="WEEK 26", c(f) level(50) g(ey50 ex50)

** Creating a scatterplot with a 50% and 90% confidence ellipse overlay twoway /// (scatter lbstresn blstresn if visit=="WEEK 26", msize(small)) /// (scatter ey90 ex90, connect(l) msymbol(none)) /// (scatter ey50 ex50, connect(l) msymbol(none) lpattern(dash)), /// title("Blood Glucose Lab Results") /// subtitle("Baseline Compared to Week 26") /// xtitle("Baseline Result (mmol/L)") /// ytitle("Week 26 Result (mmol/L)") /// xscale(range(0 25)) yscale(range(0 25)) xlabel(0(5)25) ylabel(0(5)25) /// aspectratio(1) xsize(3) ysize(4) /// legend(order(2 3) rows(2) label(2 "90% of observations") /// label(3 "50% of observations"))

graph export scatterellip.eps, replace

** Creating a scatterplot with a linear regression overlay twoway /// (lfitci lbstresn blstresn if visit=="WEEK 26", clpattern(blank) ciplot(rline) alpattern(dash)) /// (scatter lbstresn blstresn if visit=="WEEK 26", msize(small)) /// (lfit lbstresn blstresn if visit=="WEEK 26"), /// title("Blood Glucose Lab Results") /// subtitle("Baseline Compared to Week 26") /// xtitle("Baseline Result (mmol/L)") /// ytitle("Week 26 Result (mmol/L)") /// xscale(range(0 25)) yscale(range(0 25)) ylabel(0(5)25) /// aspectratio(1) xsize(3) ysize(4) /// legend(row(2) order(4 1) label(1 "95% Confidence Interval of Prediction") /// label(4 "Linear Prediction"))

graph export scatterlfit.eps, replace

** Creating a scatterplot with a fractional polynomial overlay twoway /// (fpfitci lbstresn blstresn if visit=="WEEK 26", clpattern(solid) ciplot(rline) alpattern(dash)) /// (scatter lbstresn blstresn if visit=="WEEK 26", msize(small)), /// title("Blood Glucose Lab Results") /// subtitle("Baseline Compared to Week 26") /// xtitle("Baseline Result (mmol/L)") /// ytitle("Week 26 Result (mmol/L)") /// xscale(range(0 25)) yscale(range(0 25)) ylabel(0(5)25) /// aspectratio(1) xsize(3) ysize(4) /// legend(rows(2) order(2 1) label(1 "95% Confidence Interval of Prediction") /// label(2 "Linear Prediction"))

graph export scatterfpfit.eps, replace

** Scatterplot comparing men and women (single plot) twoway /// (scatter lbstresn blstresn if visit=="WEEK 26" & sex=="M", msize(small) msymbol(smcircle)) /// (scatter lbstresn blstresn if visit=="WEEK 26" & sex=="F", msize(small) msymbol(smsquare)), /// title("Blood Glucose Lab Results") /// subtitle("Baseline Compared to Week 26") /// xtitle("Baseline Result (mmol/L)") /// ytitle("Week 26 Result (mmol/L)") /// xscale(range(0 25)) yscale(range(0 25)) ylabel(0(5)25) /// aspectratio(1) xsize(3) ysize(4) /// legend(label(1 "Males") label(2 "Females"))

graph export scattermf.eps, replace

** Scatterplot comparing men and women in two panels with lfit overlay

twoway /// (lfitci lbstresn blstresn if visit=="WEEK 26" & sex=="M", clpattern(solid) ciplot(rline) alpattern(dash)) /// (scatter lbstresn blstresn if visit=="WEEK 26" & sex=="M", msize(small) msymbol(smcircle)), /// title("Blood Glucose Lab Results for Males") /// subtitle("Baseline Compared to Week 26") /// xtitle("Baseline Result (mmol/L)", margin(medium)) /// ytitle("Week 26 Result (mmol/L)", margin(medium)) /// xscale(range(0 25)) xlabel(0(5)25) yscale(range(0 25)) ylabel(0(5)25) /// aspectratio(1) xsize(3) ysize(4) /// legend(order(2 1) label(1 "95% CI") label(2 "Prediction"))

graph save male, replace

twoway /// (lfitci lbstresn blstresn if visit=="WEEK 26" & sex=="F", clpattern(solid) ciplot(rline) alpattern(dash)) /// (scatter lbstresn blstresn if visit=="WEEK 26" & sex=="F", msize(small) msymbol(smcircle)), /// title("Blood Glucose Lab Results for Females") /// subtitle("Baseline Compared to Week 26") /// xtitle("Baseline Result (mmol/L)", margin(medium)) /// ytitle("Week 26 Result (mmol/L)", margin(medium)) /// xscale(range(0 25)) xlabel(0(5)25) yscale(range(0 25)) ylabel(0(5)25) /// aspectratio(1) xsize(3) ysize(4) /// legend(order(2 1) label(1 "95% CI") label(2 "Prediction"))

graph save female, replace

graph combine male.gph female.gph graph export scattermf2panels.eps, replace

** Creating sunflower density plot twoway sunflower lbstresn blstresn if visit=="WEEK 26", /// title("Blood Glucose Lab Results") /// subtitle("Baseline Compared to Week 26") /// xtitle("Baseline Result (mmol/L)") /// ytitle("Week 26 Result (mmol/L)") /// xscale(range(0 25)) yscale(range(0 25)) xlabel(0(5)25) ylabel(0(5)25) /// aspectratio(1) xsize(3) ysize(4) /// legend(rows(3) label(1 "Single Observation"))

graph export sunflower.eps, replace

* * Plotting Change vs Average * Bland-Altman or Tukey mean-difference plot *

gen avg = (lbstresn+blstresn)/2 label var avg "Average glucose at baseline and visit"

gen diff = lbstresn-blstresn label var diff "Difference in glucose between visit and baseline"

twoway (scatter diff avg if visit=="WEEK 26", msize(small)), /// title("Blood Glucose at Baseline and Week 26") /// subtitle("Difference versus Average") /// xtitle("Average Blood Glucose at Baseline and Week 26 (mmol/L)") /// ytitle("Difference Between Blood Glucose" "at Week 26 and Baseline (mmol/L)") /// yscale(range(-10(5)10)) ylabel(-10(5)10)

graph export scatterblandaltman.eps, replace

%ENDCODE%