Linear Splines

Lead Author(s): Peter Bacchetti, PhD

A linear spline is used in regression models to allow a predictor to have a non-linear effect on the outcome. This is useful when there is evidence against the linearity assumption or when high interest in the predictor's effect warrants more flexible modeling of its effect. Instead of a single slope, the model fits a line that is allowed to change direction at specified points, called knots, thereby allowing V-shape, U-shape, S-shape, and other non-linear relationships to be modeled. An advantage of this approach is that there is still an interpretable coefficient within each range of the predictor between knots. To fit the model, we create a new predictor variable for each range of the original predictor between knots, and the fitted regression coefficients then estimate the effect of the predictor within that range.

For example, if we are modeling the effect of age on systolic blood pressure in a linear regression model, we can use the three predictors:

agePre40 = min(age, 40)
age40to60 = max(0, min(age-40, 20))
age60up = max(0, age-60)

This will fit a linear spline with knots at 40 and 60. The variable agePre40 ranges from the minimum observed age up to 40 and is equal to 40 for anyone who is over 40. Its coefficient estimates the effect per year of age within the under 40 age range. The variable age40to60 ranges from 0 to 20; it is 0 for anyone aged 40 or less, increases from 0 to 20 within the 40 to 60 age range, and is equal to 20 for anyone aged 60 or more. Its coefficient estimates the effect per year of age in the 40 to 60 age range. The variable age60up is equal to 0 for anyone aged 60 or less and is equal to age minus 60 for anyone over age 60. Its coefficient estimates the effect per year within the over 60 age range.

Suppose we obtain the following estimated coefficients:
0.10 for agePre40
0.45 for age40to60
0.92 for age60up
The interpretation is that predicted systolic blood pressure is estimated to increase by 0.10 for each 1 year increase in age up to age 40, by 0.45 for each 1 year increase in age from 40 to 60, and by 0.92 for each 1 year increase in age after age 60. So the estimated difference between age 30 and 45 would be 10*0.10 + 5*0.45 = 3.25, and the estimated difference between age 25 and age 75 would be 15*0.1 + 20*0.45 + 15*0.92 = 24.3.

Using linear splines requires deciding how many knots to have and where to put them. Models with different numbers of knots can be compared in terms of how well they fit the data. Statistical criteria such as the Akaike Information Criterion can be used to decide between simpler (fewer knots) and more complex (more knots) models. Knots are typically placed at natural break points (e.g., decades of age), are evenly spaced in terms of the predictor's values (e.g., 15, 30, 45, 60), or are evenly spaced in terms of quantiles in the data set (e.g., 3 knots at the quartiles of the predictor's distribution).

Alternative approaches to modeling non-linear effects of a predictor are polynomial models or breaking the predictor into categories.

There are also higher order splines, notably cubic splines. These produce smooth fits to the data (no abrupt change of direction), but the coefficients do not have any simple interpretation and the fit usually must be illustrated by graphing.