We have conducted this analysis without checking whether the data we have been using have met the assumptions underlying an ordinary least squares (OLS) linear regression. Three main assumptions we will however now briefly explore are normality, homogeneity of variance (homoscedasticity) and independence. Normality of residuals is only required for valid hypothesis testing, where we need to ensure the p-values are valid; it is not required to obtain unbiased estimates of the regression coefficients. OLS requires that the residuals are identically and independently distributed, i.e. the observed error (the residual) is random.
First, we will formally test the normality of residuals to identify if we can use our analysis for valid hypothesis testing. After running our final regression analysis, we can use the ‘predict’ command with the ‘resid’ option to calculate the residuals. We can store these residual values as a variable, which in this case we will call bmi_iq2, and we can then use this variable to then check the residuals’ normality.
We can plot the residuals against a normal distribution, using either the ‘pnorm’ (which is sensitive to non-normality in the middle range of data) or ‘qnorm’ (which is sensitive to non-normality near the tails) commands. We are going to look at the ‘qnorm’ method, as we suspect that BMI is non-normal at the tails of the distribution. Previous research indicates that BMI is not symmetrical but is always skewed to the right, toward a higher ratio of weight (body mass) to height.
In the above output, the ‘qnorm’ command has plotted quintiles of the residuals of BMI at age 42 (the thicker dotted line) against the quintiles of a normal distribution (the thin diagonal line). If the two lines were exactly the same, the residuals of BMI at age 42 would be normally distributed. The plot shows that the residuals of BMI at age 42 deviate from the norm, particularly at the upper tail and are therefore not normally distributed.
To numerically test for normality, we can use the ‘swilk’ test. This performs the Shapiro-Wilk test which tests whether the distribution is normal.
In the ‘swilk’ output, we can see that the test’s p-value is <.001 and therefore we can reject the null hypothesis that the residuals in the model are normally distributed. Therefore, our general linear regression model is not appropriate for valid hypothesis testing. Regression models categorising the outcome variable BMI at age 42, into the top and or bottom tails may better reflect the distribution of the data. For example, the top of the distribution tail represents higher BMI, so transforming our continuous variable into a dichotomous variable (such as ‘obese’ versus ‘not obese’) would capture this feature of the distribution. Likewise, if we were interested in lower BMI, by transforming the bottom tail of the distribution into an ‘underweight’ versus ‘not underweight’ dichotomous variable, we would capture the opposite end of the distribution.
A commonly used graphical method for evaluating the model fit is to plot the residuals against the predicted values. If the model is well-fitted, there should be no pattern evident in the plot. We can create such a plot by using the ‘rvfplot’ command.
We can see the pattern of the data points is getting wider towards the right end which is an indication that the model is not well fitted. This implies that our linear regression model would be unable to accurately predict BMI at age 42 consistently across both low and high values of BMI.
The assumption of independence states that the errors associated with one observation are not correlated with the errors of any other observation. This assumption is often violated if measures of the same variable such as the BMI of an individual are collected over time. Measurements nearer in time are especially likely to be more highly correlated. However, in this example we note BMI of an individual may be very different at age 11 than at age 42, some 31 years later.
The Learning Hub is a resource for students and educators
tel | +44 (0)20 7331 5102 |
---|---|
closer@ucl.ac.uk |
Sign up for our email newsletters to get the latest from CLOSER
Sign up