A Beginner’s Guide to Residuals
Residuals are a powerful tool for assessing the performance of a regression model, its goodness of fit, and identifying areas of improvement. In this article, we will explore residuals within the context of linear regression, how to code and interpret residual scatter plots, and how residuals can help us improve our understanding of the data.
While there are several types of residuals, including studentized residuals, standardized residuals, and Cook’s distance, this article will focus solely on the “ordinary” or “unstandardized” residuals. This type of residual is the most basic form of residual and is simply the difference between the predicted values generated by your model and the actual values from your test set.
By examining these residuals, we can gain insights into the overall accuracy of our model’s predictions and identify any systematic errors in the model. In the regression plot below the residuals are the vertical distances between the test set data points and the model’s red regression line.
Note: the code for the graphs can be found at the bottom of the article.
While there are many tools that can be used to assess the performance of a regression model, residuals can be particularly helpful. Residuals provide a clear and intuitive measure of a model’s performance and residual plots are a useful way to visualize the model’s goodness of fit and identify any patterns or trends in the data that were not accounted for.
Avoid a common misunderstanding: residuals are not errors. The two terms are commonly used interchangeably, and, while residuals and errors are related concepts, there is a subtle difference between the two.
Residuals are the differences between the values your model predicts and the actual values from your test set — errors refer to the difference between the values your model predicts and the true (and unknown) values you are trying to predict. Simply, residuals can be understood as estimates of the model’s error.
While histograms and Q-Q plots can visualize residuals, scatter plots of residuals offer a simple and intuitive way to interpret the performance of a model. Like linear regression plots, residual scatter plots have the independent value on the horizontal axis, but where regression plots have the independent variable on the vertical axis, residual plots have the residual values. Notice that the orange line below (where the residuals equal zero) is where the model’s predicted value is the actual value from the test set.
The plot above suggests a linear relationship between the independent and dependent variables and that the linear regression model appears to capture this relationship well. This is evident from three key aspects of the plot.
- Homoscedasticity: the plot has equal variance around the horizontal line at zero. If your scatter plot has a cone shape or does not have fairly equal variance across the entire x-axis (also known as heteroscedasticity), the model has different levels of error for different values of the independent variable.
- No outliers: outliers can heavily impact a linear regression model — if the residual plot shows outliers they may warrant further investigation and eventual removal.
- No pattern or clusters of points: this indicates that the data does not significantly deviate from the assumptions of a linear regression model.
Overall, the residual scatter plot above provides evidence that the linear regression model is a good fit for the data. Residual scatter plots can be a valuable tool in understanding your data and communicating your findings to others. It’s important to keep in mind, however, that they represent only one aspect of model analysis, and it’s essential to use other evaluation metrics to provide a comprehensive view of model performance.
If your residual scatter plot does not meet these three criteria, consider returning to your model. The patterns in the residuals may reveal areas for improvement in your model — perhaps you need to drop outliers, perform hyperparameter tuning, or transform your data. You may also want to consider alternative regression models, including non-linear models, or even a completely different type of model. By identifying areas where your model is performing well and where it is falling short, you can make the necessary adjustments to improve its overall accuracy and predictive power.
Besides linear regression, residuals can be used to evaluate a variety of machine-learning models that perform regression analysis. Residuals are commonly used in the following types of regression models:
- Polynomial regression: Polynomial regression is an extension of linear regression where the relationship between the predictor variables and the outcome variable is modeled using a polynomial function. Residuals can be used to assess the fit of the polynomial function and identify any patterns or trends in the data that were not accounted for in the model.
- Time series models: Time series models are used to analyze data that changes over time, such as stock prices or weather patterns. Residuals can be used to assess the accuracy of time series models and identify any patterns or trends in the data that were not accounted for in the model.
- Neural networks: Neural networks are a type of machine learning model that are inspired by the structure and function of the human brain. Residuals can be used to evaluate the performance of neural networks, identify any issues with the model, and fine-tune the hyperparameters to improve the model’s accuracy.
In conclusion, residuals are a powerful tool in regression analysis that can help us evaluate the accuracy of a model. They allow us to detect outliers and influential points as well as assess the overall fit of the model. We can use residuals to improve our models by identifying areas where our model may be underperforming and taking corrective actions such as dropping outliers, transforming data, or using a different type of model altogether. By understanding and utilizing residuals, we can develop more accurate and reliable regression models that can provide insights and predictions for a wide range of applications.
Code for the plots found above.