Appendix C — Miscellaneous questions

Q1

Why is boosting inappropriate for linear Regression, but appropriate for decision trees?

The question has been well answered in the post. The intuitive explanation is that the weighted average of a sequence of linear regression models will also be a single linear regression model. However, if the weighted average of the sequence of linear regression models results in a linear regression model (say boosted_linear_regression) that is different from the linear regression model that is obtained by fitting directly to the data (say regular_linear_regression), then the boosted_linear_regression model will have a higher bias than the regular_linear_regression model as the regular_linear_regression model minimizes the sum of squared errors (SSE). Thus, the boosted_linear_regression model should be the same as the regular_linear_regression model for the optimal hyperparameter values of the boosting algorithm. Thus, all the hard-work of tuning the boosting model will at best lead to the linear regression model that can be obtained by fitting a linear regression model directly to the train data!

However, a sequence of shallow regression trees will not lead to the same regression tree that can be developed directly. A sequence of shallow trees will continuously reduce bias with relative less increase in variance. A single decision tree is likely to have a relatively high variance, and thus boosting with shallow trees may provide a better performance. Boosting aims to reduce bias by using low variance models, while a single decision tree has almost zero bias at the cost of having a high variance.

The second response in the post provides a mathematical explanation, which is more convincing.