14 Ensemble modeling

Ensembling models can help reduce error by leveraging the diversity and collective wisdom of multiple models. When ensembling, several individual models are trained independently and their predictions are combined to make the final prediction.

We have already seen examples of ensemble models in chapters 5 - 13. The ensembled models may reduce error by reducing the bias (boosting) and / or reducing the variance (bagging / random forests / boosting).

However, in this chapter we’ll ensemble different types of models, instead of the same type of model. We may ensemble a linear regression model, a random forest, a gradient boosting model, and as many different types of models as we wish.

Below are a couple of reasons why ensembling models can be effective in reducing error:

Bias reduction: Different models may have different biases and the ensemble can help mitigate the individual biases, leading to a more generalized and accurate prediction. For example, consider that one model has a positive bias, and another model has a negative bias for the same instance. By averaging or combining the predictions of the two models, the biases may cancel out.
Variance reduction: As seen in the case of random forests and bagged trees, by averaging or combining the predictions of multiple models, the ensemble can reduce the overall variance and improve the accuracy of the final prediction. Note that for variance reduction, the models should have a low correlation (recall the variance reduction formula of random forests).

Mathematically also, we can show the effectiveness of an ensemble model. Let’s consider the case of regression, and let the predictors be denoted as \(X\), and the response as \(Y\). Let \(f_1, ..., f_m\) be individual models. The expected MSE of an ensemble can be written as:

\[ E(MSE_{Ensemble}) = E\bigg[\bigg( \frac{1}{m} \sum_{i = 1}^{m} f_i(X) - Y \bigg)^2 \bigg] = \frac{1}{m^2} \sum_{i = 1}^{m} E \bigg[\big(f_i(X) - Y\big)^2 \bigg] + \frac{1}{m^2} \sum_{i \ne j} E\bigg[\big(f_i(X) - Y\big)\big(f_j(X) - Y\big) \bigg]\]

\[ \implies E(MSE_{Ensemble}) = \frac{1}{m}\bigg(\frac{1}{m} \sum_{i=1}^m E \bigg[\big(f_i(X) - Y\big)^2 \bigg]\bigg) + \frac{1}{m^2} \sum_{i \ne j} E\bigg[\big(f_i(X) - Y\big)\big(f_j(X) - Y\big) \bigg]\]

\[ \implies E(MSE_{Ensemble}) = \frac{1}{m}\bigg(\frac{1}{m} \sum_{i=1}^m E(MSE_{f_i})\bigg) + \frac{1}{m^2} \sum_{i \ne j} E\bigg[\big(f_i(X) - Y\big)\big(f_j(X) - Y\big) \bigg]\]

If \(f_1, ..., f_m\) are unbiased, then,

\[ E(MSE_{Ensemble}) = \frac{1}{m}\bigg(\frac{1}{m} \sum_{i=1}^m E(MSE_{f_i})\bigg) + \frac{1}{m^2} \sum_{i \ne j} Cov(f_i(X), f_j(X))\]

Assuming the models are uncorrelated (i.e., they have a zero correlation), the second term (covariance of \(f_i(.)\) and \(f_j(.)\)) reduces to zero, and the expected MSE of the ensemble reduces to:

\[ E(MSE_{Ensemble}) = \frac{1}{m}\bigg(\frac{1}{m} \sum_{i=1}^m E(MSE_{f_i})\bigg) \tag{14.1}\]

Thus, the expected MSE of an ensemble model with uncorrelated models is much smaller than the average MSE of all the models. Unless there is a model that is much better than the rest of the models, the MSE of the ensemble model is likely to be lower than the MSE of the individual models. However, there is no guarantee that the MSE of the ensemble model will be lower than the MSE of the individual models. Consider an extreme case where only one of the models have a zero MSE. The MSE of this model will be lower than the expected MSE of the ensemble model.

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import cross_val_score,train_test_split, GridSearchCV, ParameterGrid, \
StratifiedKFold, RandomizedSearchCV
from sklearn.metrics import mean_squared_error,r2_score,roc_curve,auc,precision_recall_curve, accuracy_score
from sklearn.model_selection import KFold
from sklearn.tree import DecisionTreeRegressor,DecisionTreeClassifier
from sklearn.ensemble import VotingRegressor, VotingClassifier, StackingRegressor, \
StackingClassifier, GradientBoostingRegressor,GradientBoostingClassifier, BaggingRegressor, \
BaggingClassifier,RandomForestRegressor,RandomForestClassifier,AdaBoostRegressor,AdaBoostClassifier
from sklearn.linear_model import LinearRegression,LogisticRegression, LassoCV, RidgeCV, ElasticNetCV
from sklearn.neighbors import KNeighborsRegressor
import itertools as it
import time as time
import xgboost as xgb
from catboost import CatBoostRegressor
from lightgbm import LGBMRegressor
from sklearn.preprocessing import PolynomialFeatures
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline

#Using the same datasets as used for linear regression in STAT303-2, 
#so that we can compare the non-linear models with linear regression
trainf = pd.read_csv('./Datasets/Car_features_train.csv')
trainp = pd.read_csv('./Datasets/Car_prices_train.csv')
testf = pd.read_csv('./Datasets/Car_features_test.csv')
testp = pd.read_csv('./Datasets/Car_prices_test.csv')
train = pd.merge(trainf,trainp)
test = pd.merge(testf,testp)
train.head()

	carID	brand	model	year	transmission	mileage	fuelType	tax	mpg	engineSize	price
0	18473	bmw	6 Series	2020	Semi-Auto	11	Diesel	145	53.3282	3.0	37980
1	15064	bmw	6 Series	2019	Semi-Auto	10813	Diesel	145	53.0430	3.0	33980
2	18268	bmw	6 Series	2020	Semi-Auto	6	Diesel	145	53.4379	3.0	36850
3	18480	bmw	6 Series	2017	Semi-Auto	18895	Diesel	145	51.5140	3.0	25998
4	18492	bmw	6 Series	2015	Automatic	62953	Diesel	160	51.4903	3.0	18990

X = train[['mileage','mpg','year','engineSize']]
Xtest = test[['mileage','mpg','year','engineSize']]
y = train['price']
ytest = test['price']

14.1 Ensembling regression models

14.1.1 Voting Regressor

Here, we will combine the predictions of different models. The function VotingRegressor() averages the predictions of all the models.

Below are the individual models tuned in the previous chapters.

#Tuned AdaBoost model from Section 7.2.4
model_ada = AdaBoostRegressor(estimator=DecisionTreeRegressor(max_depth=10),n_estimators=50,
                    learning_rate=1.0,  random_state=1).fit(X, y)
print("RMSE for AdaBoost = ", np.sqrt(mean_squared_error(model_ada.predict(Xtest), ytest)))

#Tuned Random forest model from Section 6.1.2
model_rf = RandomForestRegressor(n_estimators=300, random_state=1,
                        n_jobs=-1, max_features=2).fit(X, y)
print("RMSE for Random forest = ", np.sqrt(mean_squared_error(model_rf.predict(Xtest), ytest)))

# Tuned XGBoost model from Section 9.2.6
model_xgb = xgb.XGBRegressor(random_state=1,max_depth=8,n_estimators=1000, subsample = 0.75, colsample_bytree = 1.0,
                                         learning_rate = 0.01,reg_lambda=1, gamma = 100).fit(X, y)
print("RMSE for XGBoost = ", np.sqrt(mean_squared_error(model_xgb.predict(Xtest), ytest)))

#Tuned gradient boosting model from Section 8.2.5
model_gb = GradientBoostingRegressor(max_depth=8,n_estimators=100,learning_rate=0.1,
                         random_state=1,loss='huber').fit(X, y)
print("RMSE for Gradient Boosting = ", np.sqrt(mean_squared_error(model_gb.predict(Xtest), ytest)))

# Tuned Light GBM model from Section 13.1.1
model_lgbm = LGBMRegressor(subsample = 0.5, reg_lambda = 0, reg_alpha = 100, boosting_type = 'goss',
            num_leaves = 31, n_estimators = 500, learning_rate = 0.05, colsample_bytree = 1.0,
                          top_rate = 0.5).fit(X, y)
print("RMSE for LightGBM = ", np.sqrt(mean_squared_error(model_lgbm.predict(Xtest), ytest)))

# Tuned CatBoost model from Section 13.2.3
model_cat = CatBoostRegressor(subsample=0.5, num_leaves=40, n_estimators=500, max_depth=10, 
                              verbose = False, learning_rate = 0.05, colsample_bylevel=0.75, 
                              grow_policy='Lossguide', random_state = 1).fit(X, y)
print("RMSE for CatBoost = ", np.sqrt(mean_squared_error(model_cat.predict(Xtest), ytest)))

RMSE for AdaBoost =  5693.165811600585
RMSE for Random forest =  5642.45839697972
RMSE for XGBoost =  5497.553788113875
RMSE for Gradient Boosting =  5405.787029062213
RMSE for LightGBM =  5355.964600884197
RMSE for CatBoost =  5271.104736146779

Note that we don’t need to fit the models individually before fitting them simultaneously in the voting ensemble. If we fit them individual, it will unnecessarily waste time.

Let us ensemble the models using the voting ensemble with equal weights.

#Voting ensemble: Averaging the predictions of all models

#Tuned AdaBoost model from Section 7.2.4
model_ada = AdaBoostRegressor(estimator=DecisionTreeRegressor(max_depth=10),
                    n_estimators=50,learning_rate=1.0,  random_state=1)

#Tuned Random forest model from Section 6.1.2
model_rf = RandomForestRegressor(n_estimators=300, random_state=1,
                        n_jobs=-1, max_features=2)

# Tuned XGBoost model from Section 9.2.6
model_xgb = xgb.XGBRegressor(random_state=1,max_depth=8,n_estimators=1000, subsample = 0.75, 
                colsample_bytree = 1.0, learning_rate = 0.01,reg_lambda=1, gamma = 100)

#Tuned gradient boosting model from Section 8.2.5
model_gb = GradientBoostingRegressor(max_depth=8,n_estimators=100,learning_rate=0.1,
                         random_state=1,loss='huber')

# Tuned CatBoost model from Section 13.2.3
model_cat = CatBoostRegressor(subsample=0.5, num_leaves=40, n_estimators=500, max_depth=10,
                             learning_rate = 0.05, colsample_bylevel=0.75, grow_policy='Lossguide',
                             random_state=1, verbose = False)

# Tuned Light GBM model from Section 13.1.1
model_lgbm = LGBMRegressor(subsample = 0.5, reg_lambda = 0, reg_alpha = 100, boosting_type = 'goss',
                           num_leaves = 31, n_estimators = 500, learning_rate = 0.05, 
                           colsample_bytree = 1.0, top_rate = 0.5)

start_time = time.time()
en = VotingRegressor(estimators = [('xgb',model_xgb),('ada',model_ada),('rf',model_rf),
                    ('gb',model_gb), ('cat', model_cat), ('lgbm', model_lgbm)], n_jobs = -1)
en.fit(X,y)
print("Ensemble model RMSE = ", np.sqrt(mean_squared_error(en.predict(Xtest),ytest)))
print("Time taken = ", np.round((time.time() - start_time)/60,2), "minutes")

Ensemble model RMSE =  5259.899392611916
Time taken =  0.21 minutes

As expected, RMSE of the ensembled model is less than that of each of the individual models.

Note that the RMSE can be further improved by removing the weaker models from the ensemble. Let us remove the three weakest models - XGBoost, Random forest, and AdaBoost.

#Voting ensemble: Averaging the predictions of all models

start_time = time.time()
en = VotingRegressor(estimators = [('gb',model_gb), ('cat', model_cat), ('lgbm', model_lgbm)], n_jobs = -1)
en.fit(X,y)
print("Ensemble model RMSE = ", np.sqrt(mean_squared_error(en.predict(Xtest),ytest)))
print("Time taken = ", np.round((time.time() - start_time)/60,2), "minutes")

Ensemble model RMSE =  5191.814866810768
Time taken =  0.18 minutes

14.1.2 Stacking Regressor

Stacking is a more sophisticated method of ensembling models. The method is as follows:

The training data is split into K folds. Each of the K folds serves as a test data in one of the K iterations, and the rest of the folds serve as train data.
Each model is used to make predictions on each of the K folds, after being trained on the remaining K-1 folds. In this manner, each model predicts the response on each train data point - when that train data point was not used to train the model.
Predictions at each training data points are generated by each model in step 2 (the above step). These predictions are now used as predictors to train a meta-model (referred by the argument final_estimator), with the original response as the response. The meta-model (or final_estimator) learns to combine predictions of different models to make a better prediction.

14.1.2.1 Metamodel: Linear regression

#Stacking using LinearRegression as the metamodel
en = StackingRegressor(estimators = [('xgb', model_xgb),('ada', model_ada),('rf', model_rf),
                                     ('gb', model_gb), ('cat', model_cat), ('lgbm', model_lgbm)],
                     final_estimator=LinearRegression(),                                          
                    cv = KFold(n_splits = 5, shuffle = True, random_state=1))
start_time = time.time()
en.fit(X,y)
print("Linear regression metamodel RMSE = ", np.sqrt(mean_squared_error(en.predict(Xtest),ytest)))
print("Time taken = ", np.round((time.time() - start_time)/60,2), "minutes")

Linear regression metamodel RMSE =  5220.456280327686
Time taken =  2.03 minutes

#Co-efficients of the meta-model
en.final_estimator_.coef_

array([ 0.05502964,  0.14566665,  0.01093624,  0.30478283,  0.57403909,
       -0.07057344])

sum(en.final_estimator_.coef_)

1.0198810182715363

Note the above coefficients of the meta-model. The model gives the highest weight to the gradient boosting model (with huber loss), and the catboost model, and the lowest weight to the relatively weak random forest model.

Also, note that the coefficients need not sum to one.

Let us try improving the RMSE further by removing the weaker models from the ensemble. Let us remove the three weakest models based on the size of their coefficients in the linear regression metamodel.

#Stacking using LinearRegression as the metamodel
en = StackingRegressor(estimators = [('gb', model_gb), ('cat', model_cat), ('ada', model_ada)],
                     final_estimator=LinearRegression(),                                          
                    cv = KFold(n_splits = 5, shuffle = True, random_state=1))
start_time = time.time()
en.fit(X,y)
print("Linear regression metamodel RMSE = ", np.sqrt(mean_squared_error(en.predict(Xtest),ytest)))
print("Time taken = ", np.round((time.time() - start_time)/60,2), "minutes")

Linear regression metamodel RMSE =  5205.225710180056
Time taken =  1.36 minutes

The metamodel accuracy improves further, when strong models are ensembled.

#Co-efficients of the meta-model
en.final_estimator_.coef_

array([0.31824119, 0.54231032, 0.15998634])

sum(en.final_estimator_.coef_)

1.020537847948332

14.1.2.2 Metamodel: Lasso

#Stacking using Lasso as the metamodel
en = StackingRegressor(estimators = [('xgb', model_xgb),('ada', model_ada),('rf', model_rf),
                        ('gb', model_gb),('cat', model_cat), ('lgbm', model_lgbm) ],
                     final_estimator = LassoCV(),                                          
                    cv = KFold(n_splits = 5, shuffle = True, random_state=1))
start_time = time.time()
en.fit(X,y)
print("Lasso metamodel RMSE = ", np.sqrt(mean_squared_error(en.predict(Xtest),ytest)))
print("Time taken = ", np.round((time.time() - start_time)/60,2), "minutes")

Lasso metamodel RMSE =  5206.021083501416
Time taken =  2.05 minutes

#Coefficients of the lasso metamodel
en.final_estimator_.coef_

array([ 0.03524446,  0.15077605,  0.        ,  0.30392268,  0.52946243,
       -0.        ])

Note that lasso reduces the weight of the weak random forest model, and light gbm model to 0. Even though light GBM is a strong model, it may be correlated or collinear with XGBoost, or other models, and hence is not needed.

Note that as lasso performs model selection on its own, removing models with zero coefficients or weights does not make a difference, as shown below.

#Stacking using Lasso as the metamodel
en = StackingRegressor(estimators = [('xgb', model_xgb),('ada', model_ada),
                        ('gb', model_gb),('cat', model_cat) ],
                     final_estimator = LassoCV(),                                          
                    cv = KFold(n_splits = 5, shuffle = True, random_state=1))
start_time = time.time()
en.fit(X,y)
print("Lasso metamodel RMSE = ", np.sqrt(mean_squared_error(en.predict(Xtest),ytest)))
print("Time taken = ", np.round((time.time() - start_time)/60,2), "minutes")

Lasso metamodel RMSE =  5205.93233977352
Time taken =  1.79 minutes

#Coefficients of the lasso metamodel
en.final_estimator_.coef_

array([0.03415944, 0.15053122, 0.30464838, 0.53006297])

14.1.2.3 Metamodel: Random forest

A highly flexible model such as a random forest may not be a good choice for ensembling correlated models. However, let us tune the random forest meta model, and check its accuracy.

# Tuning hyperparameter of the random forest meta-model
start_time = time.time()
oob_score_i = []
for i in range(1, 7):
    en = StackingRegressor(estimators = [('xgb', model_xgb),('ada', model_ada),('rf', model_rf),
                        ('gb', model_gb),('cat', model_cat), ('lgbm', model_lgbm)],
                     final_estimator = RandomForestRegressor(max_features = i, oob_score = True),                                          
                    cv = KFold(n_splits = 5, shuffle = True, random_state=1)).fit(X,y)
    oob_score_i.append(en.final_estimator_.oob_score_)
print("Time taken = ", np.round((time.time() - start_time)/60,2), "minutes")

Time taken =  12.08 minutes

print("Optimal value of max_features =", np.array(oob_score_i).argmax() + 1)

Optimal value of max_features = 1

# Training the tuned random forest metamodel
start_time = time.time()
en = StackingRegressor(estimators = [('xgb', model_xgb),('ada', model_ada),
                    ('rf', model_rf), ('gb', model_gb),('cat', model_cat), 
                    ('lgbm', model_lgbm)],
                final_estimator = RandomForestRegressor(max_features = 1, 
                n_estimators=500), cv = KFold(n_splits = 5, shuffle = True, 
                random_state=1)).fit(X,y)
print("Random Forest metamodel RMSE = ", np.sqrt(mean_squared_error(en.predict(Xtest),ytest)))
print("Time taken = ", np.round((time.time() - start_time)/60,2), "minutes")

Random Forest metamodel RMSE =  5441.9155087961
Time taken =  1.71 minutes

Note that highly flexible models may not be needed when the predictors are highly correlated with the response. However, in some cases, they may be useful, as in the classification example in the next section.

14.1.2.4 Metamodel: CatBoost

#Stacking using MARS as the meta-model
en = StackingRegressor(estimators = [('xgb', model_xgb),('ada', model_ada),('rf', model_rf),
                        ('gb', model_gb),('cat', model_cat), ('lgbm', model_lgbm)],
                     final_estimator = CatBoostRegressor(verbose = False),                                          
                    cv = KFold(n_splits = 5, shuffle = True, random_state=1))
start_time = time.time()
en.fit(X,y)
print("Random Forest metamodel RMSE = ", np.sqrt(mean_squared_error(en.predict(Xtest),ytest)))
print("Time taken = ", np.round((time.time() - start_time)/60,2), "minutes")

Random Forest metamodel RMSE =  5828.803609683251
Time taken =  1.66 minutes

14.2 Ensembling classification models

We’ll ensemble models for predicting accuracy of identifying people having a heart disease.

data = pd.read_csv('./Datasets/Heart.csv')
data.dropna(inplace = True)
#Response variable
y = pd.get_dummies(data['AHD'])['Yes']

#Creating a dataframe for predictors with dummy variables replacing the categorical variables
X = data.drop(columns = ['AHD','ChestPain','Thal'])
X = pd.concat([X,pd.get_dummies(data['ChestPain']),pd.get_dummies(data['Thal'])],axis=1)

#Creating train and test datasets
Xtrain,Xtest,ytrain,ytest = train_test_split(X,y,train_size = 0.5,random_state=1)

Let us tune the individual models first.

AdaBoost

# Tuning Adaboost for maximizing accuracy
model = AdaBoostClassifier(random_state=1)
grid = dict()
grid['n_estimators'] = [10, 50, 100,200,500]
grid['learning_rate'] = [0.0001, 0.001, 0.01,0.1, 1.0]
grid['base_estimator'] = [DecisionTreeClassifier(max_depth=1), DecisionTreeClassifier(max_depth=2), 
                          DecisionTreeClassifier(max_depth=3),DecisionTreeClassifier(max_depth=4)]
# define the evaluation procedure
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=1)
# define the grid search procedure
grid_search = GridSearchCV(estimator=model, param_grid=grid, n_jobs=-1, cv=cv, scoring='accuracy',refit='accuracy')
# execute the grid search
grid_result = grid_search.fit(Xtrain, ytrain)
# summarize the best score and configuration
print("Best: %f using %s" % (grid_result.best_score_, grid_result.best_params_))

Best: 0.871494 using {'base_estimator': DecisionTreeClassifier(max_depth=1), 'learning_rate': 0.01, 'n_estimators': 200}

Gradient Boosting

# Tuning gradient boosting for maximizing accuracy
model = GradientBoostingClassifier(random_state=1)
grid = dict()
grid['n_estimators'] = [10, 50, 100,200,500]
grid['learning_rate'] = [0.0001, 0.001, 0.01,0.1, 1.0]
grid['max_depth'] = [1,2,3,4,5]
grid['subsample'] = [0.5,1.0]
# define the evaluation procedure
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=1)
# define the grid search procedure
grid_search = GridSearchCV(estimator=model, param_grid=grid, n_jobs=-1, cv=cv, scoring='accuracy',refit='accuracy')
# execute the grid search
grid_result = grid_search.fit(Xtrain, ytrain)
# summarize the best score and configuration
print("Best: %f using %s" % (grid_result.best_score_, grid_result.best_params_))

Best: 0.871954 using {'learning_rate': 1.0, 'max_depth': 4, 'n_estimators': 100, 'subsample': 1.0}

XGBoost

# Tuning XGBoost for maximizing accuracy
start_time = time.time()
param_grid = {'n_estimators':[25, 100,250,500],
                'max_depth': [4, 6 ,8],
              'learning_rate': [0.01,0.1,0.2],
               'gamma': [0, 1, 10, 100],
               'reg_lambda':[0, 10, 100],
               'subsample': [0.5, 0.75, 1.0]
                'scale_pos_weight':[1.25,1.5,1.75]#Control the balance of positive and negative weights, useful for unbalanced classes. A typical value to consider: sum(negative instances) / sum(positive instances).
             }

cv = StratifiedKFold(n_splits=5,shuffle=True,random_state=1)
optimal_params = GridSearchCV(estimator=xgb.XGBClassifier(random_state=1),
                             param_grid = param_grid,
                             scoring = 'accuracy',
                             verbose = 1,
                             n_jobs=-1,
                             cv = cv)
optimal_params.fit(Xtrain,ytrain)
print(optimal_params.best_params_,optimal_params.best_score_)
print("Time taken = ", (time.time()-start_time)/60, " minutes")

Fitting 5 folds for each of 972 candidates, totalling 4860 fits
{'gamma': 0, 'learning_rate': 0.2, 'max_depth': 4, 'n_estimators': 25, 'reg_lambda': 0, 'scale_pos_weight': 1.25} 0.872183908045977
Time taken =  0.9524135629336039  minutes

#Tuned Adaboost model
model_ada = AdaBoostClassifier(base_estimator=DecisionTreeClassifier(max_depth=1), n_estimators=200, 
                               random_state=1,learning_rate=0.01).fit(Xtrain, ytrain)    
test_accuracy_ada = model_ada.score(Xtest,ytest) #Returns the classification accuracy of the model on test data
    
#Tuned Random forest model from Section 6.3
model_rf = RandomForestClassifier(n_estimators=500, random_state=1,max_features=3,
                        n_jobs=-1,oob_score=False).fit(Xtrain, ytrain)
test_accuracy_rf = model_rf.score(Xtest,ytest) #Returns the classification accuracy of the model on test data
    
#Tuned gradient boosting model
model_gb = GradientBoostingClassifier(n_estimators=100, random_state=1,max_depth=4,learning_rate=1.0,
                                     subsample = 1.0).fit(Xtrain, ytrain)
test_accuracy_gb = model_gb.score(Xtest,ytest) #Returns the classification accuracy of the model on test data

#Tuned XGBoost model
model_xgb = xgb.XGBClassifier(random_state=1,gamma=0,learning_rate = 0.2,max_depth=4,
                              n_estimators = 25,reg_lambda = 0,scale_pos_weight=1.25).fit(Xtrain,ytrain)
test_accuracy_xgb = model_xgb.score(Xtest,ytest) #Returns the classification accuracy of the model on test data

print("Adaboost accuracy = ",test_accuracy_ada)
print("Random forest accuracy = ",test_accuracy_rf)
print("Gradient boost accuracy = ",test_accuracy_gb)
print("XGBoost model accuracy = ",test_accuracy_xgb)

Adaboost accuracy =  0.7986577181208053
Random forest accuracy =  0.8120805369127517
Gradient boost accuracy =  0.7986577181208053
XGBoost model accuracy =  0.7785234899328859

14.2.1 Voting classifier - hard voting

In this type of ensembling, the predicted class is the one predicted by the majority of the classifiers.

ensemble_model = VotingClassifier(estimators=[('ada',model_ada),('rf',model_rf),('gb',model_gb),('xgb',model_xgb)])
ensemble_model.fit(Xtrain,ytrain)
ensemble_model.score(Xtest, ytest)

0.825503355704698

Note that the prediction accuracy of the ensemble is higher than the prediction accuracy of each of the individual models on unseen data.

14.2.2 Voting classifier - soft voting

In this type of ensembling, the predicted class is the one based on the average predicted probabilities of all the classifiers. The threshold probability is 0.5.

ensemble_model = VotingClassifier(estimators=[('ada',model_ada),('rf',model_rf),('gb',model_gb),('xgb',model_xgb)],
                                 voting='soft')
ensemble_model.fit(Xtrain,ytrain)
ensemble_model.score(Xtest, ytest)

0.7919463087248322

Note that soft voting will be good only for well calibrated classifiers, i.e., all the classifiers must have probabilities at the same scale.

14.2.3 Stacking classifier

Conceptually, the idea is similar to that of Stacking regressor.

#Using Logistic regression as the meta model (final_estimator)
ensemble_model = StackingClassifier(estimators=[('ada',model_ada),('rf',model_rf),('gb',model_gb),('xgb',model_xgb)],
                                   final_estimator=LogisticRegression(random_state=1,max_iter=10000),n_jobs=-1,
                                   cv = StratifiedKFold(n_splits=5,shuffle=True,random_state=1))
ensemble_model.fit(Xtrain,ytrain)
ensemble_model.score(Xtest, ytest)

0.7986577181208053

#Coefficients of the logistic regression metamodel
ensemble_model.final_estimator_.coef_

array([[0.81748051, 1.28663164, 1.64593342, 1.50947087]])

#Using random forests as the meta model (final_estimator). Note that random forest will require tuning
ensemble_model = StackingClassifier(estimators=[('ada',model_ada),('rf',model_rf),('gb',model_gb),('xgb',model_xgb)],
                                   final_estimator=RandomForestClassifier(n_estimators=500, max_features=1,
                                                                          random_state=1,oob_score=True),n_jobs=-1,
                                   cv = StratifiedKFold(n_splits=5,shuffle=True,random_state=1))
ensemble_model.fit(Xtrain,ytrain)
ensemble_model.score(Xtest, ytest)

0.8322147651006712

Note that a complex final_estimator such as random forest will require tuning. In the above case, the max_features argument of random forests has been tuned to obtain the maximum OOB score. The tuning is shown below.

#Tuning the random forest parameters
start_time = time.time()
oob_score = {}

i=0
for pr in range(1,5):
    model = StackingClassifier(estimators=[('ada',model_ada),('rf',model_rf),('gb',model_gb),('xgb',model_xgb)],
                                   final_estimator=RandomForestClassifier(n_estimators=500, max_features=pr,
                                    random_state=1,oob_score=True),n_jobs=-1,
                                   cv = StratifiedKFold(n_splits=5,shuffle=True,random_state=1)).fit(Xtrain, ytrain)
    oob_score[pr] = model.final_estimator_.oob_score_
    
end_time = time.time()
print("time taken = ", (end_time-start_time)/60, " minutes")
print("max accuracy = ", np.max(list(oob_score.values())))
print("Best value of max_features= ", np.argmax(list(oob_score.values()))+1)

time taken =  0.33713538646698  minutes
max accuracy =  0.8445945945945946
Best value of max_features=  1

#The final predictor (metamodel) - random forest obtains the maximum oob_score for max_features = 1
oob_score

{1: 0.8445945945945946,
 2: 0.831081081081081,
 3: 0.8378378378378378,
 4: 0.831081081081081}

14.2.4 Tuning all models simultaneously

Individual model hyperparameters can be tuned simultaneously while ensembling them with a VotingClassifier(). However, this approach can be too expensive for even moderately-sized datasets.

# Create the param grid with the names of the models as prefixes

model_ada = AdaBoostClassifier(base_estimator = DecisionTreeClassifier())
model_rf = RandomForestClassifier()
model_gb = GradientBoostingClassifier()
model_xgb = xgb.XGBClassifier()

ensemble_model = VotingClassifier(estimators=[('ada',model_ada),('rf',model_rf),('gb',model_gb),('xgb',model_xgb)])

hp_grid = dict()

# XGBoost
hp_grid['xgb__n_estimators'] = [25, 100,250,50]
hp_grid['xgb__max_depth'] = [4, 6 ,8]
hp_grid['xgb__learning_rate'] = [0.01, 0.1, 1.0]
hp_grid['xgb__gamma'] = [0, 1, 10, 100]
hp_grid['xgb__reg_lambda'] = [0, 1, 10, 100]
hp_grid['xgb__subsample'] = [0, 1, 10, 100]
hp_grid['xgb__scale_pos_weight'] = [1.0, 1.25, 1.5]
hp_grid['xgb__colsample_bytree'] = [0.5, 0.75, 1.0]

# AdaBoost
hp_grid['ada__n_estimators'] = [10, 50, 100,200,500]
hp_grid['ada__base_estimator__max_depth'] = [1, 3, 5]
hp_grid['ada__learning_rate'] = [0.01, 0.1, 0.2]

# Random Forest
hp_grid['rf__n_estimators'] = [100]
hp_grid['rf__max_features'] = [3, 6, 9, 12, 15]

# GradBoost
hp_grid['gb__n_estimators'] = [10, 50, 100,200,500]
hp_grid['gb__max_depth'] = [1, 3, 5]
hp_grid['gb__learning_rate'] = [0.01, 0.1, 0.2, 1.0]
hp_grid['gb__subsample'] = [0.5, 0.75, 1.0]

start_time = time.time()
grid = RandomizedSearchCV(ensemble_model, hp_grid, cv=5, scoring='accuracy', verbose = True,
                         n_iter = 100, n_jobs=-1).fit(Xtrain, ytrain)
print("Time taken = ", round((time.time()-start_time)/60), " minutes")

grid.best_estimator_.score(Xtest, ytest)

0.8120805369127517

14.3 Ensembling models based on different sets of predictors

Generally, tree-based models such as CatBoost, and XGBoost are the most accurate, while other models, such as bagging, random forests, KNN, and linear models, may not be as accurate. Thus, sometimes, the weaker models, despite bringing-in diversity in the model ensemble may deteriorate the ensemble accuracy due to their poor individual performance (check slides for technical details). Thus, sometimes, another approach to bring-in model diversity is to develop strong models based on different sets of predictors, and ensemble them.

Different feature selection methods (such as Lasso, feature importance returned by tree-based methods, stepwise k-fold feature selection, etc.), may be used to obtain different sets of important features, strong models can be tuned on these sets, and then ensembled. Even though the models may be of the same type, the different sets of predictors will help bring-in the element of diversity in the ensemble.

trainf = pd.read_csv('./Datasets/Car_features_train.csv')
trainp = pd.read_csv('./Datasets/Car_prices_train.csv')
testf = pd.read_csv('./Datasets/Car_features_test.csv')
testp = pd.read_csv('./Datasets/Car_prices_test.csv')
train = pd.merge(trainf,trainp)
test = pd.merge(testf,testp)
train.head()

	carID	brand	model	year	transmission	mileage	fuelType	tax	mpg	engineSize	price
0	18473	bmw	6 Series	2020	Semi-Auto	11	Diesel	145	53.3282	3.0	37980
1	15064	bmw	6 Series	2019	Semi-Auto	10813	Diesel	145	53.0430	3.0	33980
2	18268	bmw	6 Series	2020	Semi-Auto	6	Diesel	145	53.4379	3.0	36850
3	18480	bmw	6 Series	2017	Semi-Auto	18895	Diesel	145	51.5140	3.0	25998
4	18492	bmw	6 Series	2015	Automatic	62953	Diesel	160	51.4903	3.0	18990

X = train[['mileage','mpg','year','engineSize']]
Xtest = test[['mileage','mpg','year','engineSize']]
y = train['price']
ytest = test['price']

We will create polynomial interactions to develop two sets of predictors - first order predictors, and second order predictors.

poly_set = PolynomialFeatures(2, include_bias = False)
X_poly = poly_set.fit_transform(X)
X_poly = pd.DataFrame(X_poly, columns=poly_set.get_feature_names_out())
X_poly.columns = X_poly.columns.str.replace("^", "_", regex=True)
Xtest_poly = poly_set.fit_transform(Xtest)
Xtest_poly = pd.DataFrame(Xtest_poly, columns=poly_set.get_feature_names_out())
Xtest_poly.columns = Xtest_poly.columns.str.replace("^", "_", regex=True)

Let us use 2 different sets of predictors to introduce diversity in the ensemble.

col_set1 = ['mileage','mpg', 'year','engineSize']
col_set2 = X_poly.columns

Let us use two types of strong tree-based models.

cat = CatBoostRegressor(verbose=False)
gb = GradientBoostingRegressor(loss = 'huber')

We will use the Pipeline() function along with ColumnTransformer() to map a predictor set to each model.

cat_pipe1 = Pipeline([
    ('column_transformer', ColumnTransformer([('cat1_transform', 'passthrough', col_set1)], remainder='drop')),
    ('cat1', cat)
])

cat_pipe2 = Pipeline([
    ('column_transformer', ColumnTransformer([('cat2_transform', 'passthrough', col_set2)], remainder='drop')),
    ('cat2', cat)
])

gb_pipe1 = Pipeline([
    ('column_transformer', ColumnTransformer([('gb1_transform', 'passthrough', col_set1)], remainder='drop')),
    ('gb1', gb)
])

gb_pipe2 = Pipeline([
    ('column_transformer', ColumnTransformer([('gb2_transform', 'passthrough', col_set2)], remainder='drop')),
    ('gb2', gb)
])

We will use Linear regression to ensemble the models.

en_new.final_estimator_.coef_

array([ 0.30127482,  0.79242981, -0.07168258, -0.01781781])

en_new = StackingRegressor(estimators = [('cat1', cat_pipe1),('cat2', cat_pipe2),
                                        ('gb1', gb_pipe1), ('gb2', gb_pipe2)],
                     final_estimator=LinearRegression(),                                          
                    cv = KFold(n_splits = 15, shuffle = True, random_state=1))

en_new.fit(X_poly, y)

mean_squared_error(en_new.predict(Xtest_poly), ytest, squared = False)

5185.376722607323

Note that the above model does better on test data than all the models developed so far. Using different sets of predictors introduces diversity in the ensemble, as an alternative to including “weaker” models in the ensemble to add diversity.

Check the idea being used in the Spring 2023 prediction problem in the appendix.