Linearity Assumption

Lead Author(s): Peter Bacchetti, PhD

In regression models, numeric predictors are assumed to have a linear association with the outcome. In mathematical terms with a predictor X and outcome Y, this means a model like:

Y = a + bX + <other predictors' effects>, where a and b are constants.

(For some types of models, a function of Y is used instead of Y itself, like the log odds for logistic regression or log hazard for Cox proportional hazards regression.)

A less mathematical way to think about this assumption is that it means that each 1 unit difference in the predictor is associated with the same effect on the outcome regardless of whether it is a difference between 0 and 1 or between 100 and 101. Similarly, each 10 point difference is associated with the same effect on the outcome regardless of whether it is a difference between 5 and 15 or between 150 and 160 or between 2345 and 2355.

A simple way to relax this assumption is to add a quadratic term for X to the model, Xsquared = X times X. If the P-value for Xsquared is <0.05, this is usually considered to be strong evidence that the linearity assumption is inaccurate. In this case, a more flexible model can be fitted by categorizing X, using linear splines, or simply retaining Xsquared in the model. Categorizing or using linear splines will provide interpretable regression coefficients, but retaining Xsquared or fitting a higher order polynomial model will usually necessitate a graphical depiction of the effect of X to aid interpretation.