Assignment 5

Instructions

You may talk to a friend, discuss the questions and potential directions for solving them. However, you need to write your own solutions and code separately, and not as a group activity.
Write your code in the Code cells and your answers in the Markdown cells of the Jupyter notebook. Ensure that the solution is written neatly enough to for the graders to understand and follow.
Use Quarto to render the .ipynb file as HTML. You will need to open the command prompt, navigate to the directory containing the file, and use the command: quarto render filename.ipynb --to html. Submit the HTML file.
The assignment is worth 100 points, and is due on Tuesday, 4th June 2024 at 11:59 pm.
Five points are properly formatting the assignment. The breakdown is as follows:
- Must be an HTML file rendered using Quarto (1 point). If you have a Quarto issue, you must mention the issue & quote the error you get when rendering using Quarto in the comments section of Canvas, and submit the ipynb file.
- No name can be written on the assignment, nor can there be any indicator of the student’s identity—e.g. printouts of the working directory should not be included in the final submission. (1 point)
- There aren’t excessively long outputs of extraneous information (e.g. no printouts of entire data frames without good reason, there aren’t long printouts of which iteration a loop is on, there aren’t long sections of commented-out code, etc.) (1 point)
- Final answers to each question are written in the Markdown cells. (1 point)
- There is no piece of unnecessary / redundant code, and no unnecessary / redundant text. (1 point)
The maximum possible score in the assigment is 95 + 5 = 100 out of 100.

1) Conceptual (9 points)

a)

Is it possible for an ensemble model (Voting or Stacking) to perform worse than one or more of its base models? (1 point) Why or why not? (4 points)

b)

If the ensemble model (Voting or Stacking) does perform worse than one or more of its base models, then what should be the course of action? (4 points)

2) Regression with Ensembles (45 points)

For this question, you will use the miami_housing.csv file. You can find the description for the variables here.

The SALE_PRC variable is the regression response and the rest of the variables, except PARCELNO, are the predictors.

a)

Read the dataset. Create the training and test sets with a 60%-40% split and random_state = 1. (1 point)

b)

Recreate and train all the following tuned regression models: (5x0.5 = 2.5 points) Set all random_states to 1.

Bagged Trees from Assignment 3
Random Forest from Assignment 3
AdaBoost from Assignment 4
Gradient Boosting from Assignment 4
XGBoost from Assignment 4

Note that there will not be any cross-validation, since the models are already tuned, i.e. created with the best hyperparameters found in the previous assignments.

Print the test MAE of all five tuned models. (5x0.5 = 2.5 points)

c)

Train a Voting Ensemble Regressor with all the five models in part b. Note that all the models are separately tuned, which means the Voting Ensemble is tuned (in a greedy way.) Print the test MAE of the Voting Ensemble. (5 points) Is it better than all the base models? (1 point)

d)

Retrain the Voting Ensemble Regressor with the two models that return the lowest MAE in part b. Print the test MAE and compare with the results in part b and c. (3 points)

e) Stacking ensemble with Linear regression

Develop a linear regression metamodel based on the models in Q2(b). Report the MAE of the metamodel on test data. Which model has the highest weight in the ensemble?

Note:

You may use the StackingRegressor() function. However, as the next set of questions ask you to develop different metamodels based on the models in 2(b), using the StackingRegressor() will be inefficient as it will involve fitting each of the individual models every time it is called.
A faster way will be to use the cross_val_predict() function to compute th 5-fold cross-validated predictions from each of the models in 2(b), consider these predictions from the 5 models as 5 different predictors, and fit the metamodel. Once computed, these cross-validated predictions can be used with different metamodels without the need of fitting the individual models repeatedlty with StackingRegressor().
If you are using the faster way (in (2)), use a 5-fold KFold object with random_state=1 and shuffle=True for the base model predictions.

(5 points)

f)

Print the weights of the base models of the Stacking Ensemble in 2(e). (2 points)

g)

Train and tune a Stacking Ensemble Regressor with all the five models in 2(b) and Lasso as the meta model. Use a 5-fold KFold object with random_state=1 and shuffle=True both for the base model predictions and the cross-validation of the ensemble model. Try out [0.001, 0.01, 0.1, 1, 10, 100] for the Lasso hyperparameter. (3 points)

Print the best CV and test MAE. (1 point)

You are optimizing for MAE, will you use LassoCV or grid search, why (2 points)?

h)

Print the weights of the base models of the Stacking Ensemble in part g. (1 point)

i)

Train and tune a Stacking Ensemble Regressor with all the five models in 2(b) and Decision Tree as the meta model. (random_state=1) Use a 5-fold KFold object with random_state=1 and shuffle=True both for the base model predictions and the cross-validation of the ensemble model. Try out max_depth from 2 to 10 (inclusive) for the tree hyperparameter. (5 points)

Print the best CV and test MAE. (1 point)

j)

Print the importances of the base models of the Stacking Ensemble in part i. (2 points)

k)

Compare the weights and importances you found in 2(f), 2(h), and 2(j). Considering the base model performances, do they make sense? (2 points)

l)

Using the three Stacking Ensemble models you created in parts 2(e), 2(g), and 2(i), create an “ensemble of ensembles” which will use the three Stacking Ensemble models as base models to a Voting Ensemble Regressor. (5 points) Print the test MAE. (1 point)

3) Classification with Ensembles (41 points)

For this question, you will use the train.csv and test.csv files. Each observation is a marketing call from a banking institution. y variable represents if the client subscribed for a term deposit (1) or not (0) and it is the classification response.

The predictors are age, day, month, and education. (As mentioned last quarter, duration cannot be used as a predictor - no credit will be given to models that use it.)

a)

Preprocess the data:

Read the files.
Create the predictor and response variables.
Convert the response to 1s and 0s.
One-hot-encode the categorical predictors (Do not use drop_first.)

(1 point)

b)

Train a hard Voting Ensemble Classifier with all the following tuned models and their tuned thresholds:

Random Forest from Assignment 3
LightGBM from Assignment 4
CatBoost from Assignment 4

Print the test accuracy and test recall of the ensemble. (8 points)

c)

Using the base models in part b, train a soft Voting Classifier. Note that you should not use the tuned thresholds for the base models, but tune the threshold by cross-validating the prediction probabilities of the ensemble. (10 points)

Print the best cross-validation accuracy with a recall above 60%, along with the threshold that returns that combination. (2 points)

d)

Using the trained Soft Voting Classifier in part c and the tuned threshold, print the test accuracy and test recall. (2 points)

e)

Using the base models in 3(b), train a Stacking Classifier using logistic regression as the metamodel. Tune both the hyperparameter and the threshold to reach a cross-validation accuracy of 70% and a cross-validation recall of 60%. Print the optimal value of the regularization parameter and threshold.

(13 points)

f)

Using the trained Stacking Classifier in 3(e) and the tuned threshold, print the test accuracy and test recall. (2 points)

g)

Return the weights of the base models in the Stacking Classifier. (2 points) Which base model seems to be the most important for the ensemble? (1 point)