Assignment 4

Instructions

  1. You may talk to a friend, discuss the questions and potential directions for solving them. However, you need to write your own solutions and code separately, and not as a group activity.

  2. Write your code in the Code cells and your answers in the Markdown cells of the Jupyter notebook. Ensure that the solution is written neatly enough to for the graders to understand and follow.

  3. Use Quarto to render the .ipynb file as HTML. You will need to open the command prompt, navigate to the directory containing the file, and use the command: quarto render filename.ipynb --to html. Submit the HTML file.

  4. The assignment is worth 100 points, and is due on Monday, 4th March 2024 at 11:59 pm.

  5. There is a bonus question worth 11 points.

  6. Five points are properly formatting the assignment. The breakdown is as follows:

    • Must be an HTML file rendered using Quarto (1 point). If you have a Quarto issue, you must mention the issue & quote the error you get when rendering using Quarto in the comments section of Canvas, and submit the ipynb file.
    • No name can be written on the assignment, nor can there be any indicator of the student’s identity—e.g. printouts of the working directory should not be included in the final submission. (1 point)
    • There aren’t excessively long outputs of extraneous information (e.g. no printouts of entire data frames without good reason, there aren’t long printouts of which iteration a loop is on, there aren’t long sections of commented-out code, etc.) (1 point)
    • Final answers to each question are written in the Markdown cells. (1 point)
    • There is no piece of unnecessary / redundant code, and no unnecessary / redundant text. (1 point)
  7. The maximum possible score in the assigment is 99 + 11 + 5 = 115 out of 100.

1) Modeling the Radii of Exoplanets (40 points)

For this question, we are interested in predicting the radius of exoplanets (planets outside the Solar System) in kilometers. To achieve this goal, we will use NASA’s Composite Planetary Systems dataset and different regression models. (See https://exoplanetarchive.ipac.caltech.edu for more context.)

Read all three CompositePlanetarySystems datasets - you should have one training and two test datasets. Each row is an exoplanet. pl_rade column represents the radius of each exoplanet as a proportion of Earth’s radius, which is approximately 6,378 km.

a)

Develop a linear regression model (no non-linear terms) to predict pl_rade using all the variables in the data except pl_name, disc_facility and disc_locale. You can use statsmodels or sklearn. (2 points)

b)

Find the RMSE of the model using both test sets separately. (You need to print two RMSE values.) Note that the library you used should not make a difference here! (2 points)

Print the training RMSE as well for reference. (1 point)

c)

Compare the training and test RMSEs. (1 point) What is the issue with this model? (1 point)

d)

Train a Ridge regression model to predict pl_rade using the same variables as above. Optimize the hyperparameter using the RidgeCV object with LOOCV and neg_root_mean_squared_error scoring. What is the optimal hyperparameter value? (5 points)

Note:

  • Keep in mind that scaling is always necessary before Ridge/Lasso regression.
  • Use the following array of possible hyperparameter values: alphas = np.logspace(2,0.5,200)
  • You have to use the RidgeCV object for this question.

e)

Using the optimized and trained model, print the RMSEs for the training set and both test sets. (4 points)

f)

How did the training and test performance change? Explain why the Ridge regression changed the training and test results. (3 points)

g)

Find the predictor whose coefficient is shrunk by far the most by Ridge regularization. (3 points)

Hint: .coef_ and .columns attributes should be helpful.

h)

Why did the coefficient of the predictor identified in the previous question shrunk the most? Justify your answer for credit. (2 points)

Hint: Correlation vector/matrix

i)

Visualize how the coefficients change with the change in the hyperparameter value:

  • Create a line plot of coefficient values vs. the hyperparameter value.
  • Color code each predictor’s coefficient values.
  • Use log scale where necessary.
  • Use an alphas vector of np.logspace(7,0,200) for better visualization

(5 points)

j)

Recreate some of the previous steps with Lasso regression.

  • Using LassoCV only, find the optimal hyperparameter value. (2 points)
    • You need a different hyperparameter array - use: np.logspace(0,-2.5,200)
    • Use 10-fold CV.
    • Lasso object does not have a scoring input.
  • Using the optimized and trained Lasso model, print the RMSEs for the training set and both test sets. (2 points)
  • Visualize how the coefficients change with the change in the hyperparameter value. (2 points)
    • Use the hyperparameter array as np.logspace(7,-2.5,200) for better visualization.

k)

Using the two figures created in parts i and j, explain how the Ridge and the Lasso models behave differently as the hyperparameter value changes. (2 points) What does that difference mean for the usage of the Lasso model? (1 point)

l)

Find the predictors that are eliminated by Lasso regularization. (2 points)

2) Improving House Prices Prediction with Higher-order Terms and Crossvalidation (29 points)

In this question, we are interested in improving the prediction performance for house prices using five predictors.

a)

Read the house feature and price files and create the training and test datasets. The response is log-price and the five predictors are the rest of the variables, except house_id. (2 points)

b)

In class, we saw how an entirely linear model is not enough to capture the complexity in the relationship between the response and the predictors - in other words, it is underfitting. We want to analyze how the training and test performance change as the level of model complexity increases.

Using PolynomialFeatures object, create higher-order versions of the predictors (both transformations and interactions) in the training and test data. (3 points) Using all predictors (linear and transformed), train a Ridge model with alpha=0.000001 (2 points) and store the training and test RMSEs. (2 points) Repeat this process from order 1 to order 6. (2 points)

Finally, plot the training and test RMSE values on the same figure against the order. (1 point) Make sure the two lineplots have different colors and a legend is included. (1 points)

Note:

  • This question needs a loop.
  • PolynomialFeatures object keeps the lower order terms (k-1 to 1) while creating new predictors of order k, so no need to concatenate.
  • Don’t forget to exclude the bias term created by default with PolynomialFeatures.
  • Don’t forget to scale (correctly) (2 points)
  • Minimal regularization is necessary for this question, as opposed to pure Linear Regression, otherwise the test RMSE values go to infinity very quickly for higher orders.

c)

Which order has the best test RMSE? (1 point) What is the best test RMSE? (1 point) At which order does the overfitting start? (1 point)

d)

Repeat part b, only this time use RidgeCV to find the best amount of regularization for each order by cross-validation. Use alphas = np.logspace(2,0.5,200) and LOOCV. Use neg_root_mean_squared_error for scoring. Create the same plot as part b. (4 points) Describe the obvious difference between the plot in this part and the plot in part b. (2 points)

e)

What is the best test RMSE found by using higher-orders and regularization? (1 point) Which order achieved this test RMSE? (1 point) Why did this order with regularization perform better than any lower order with or (almost) without regularization? (3 points)

3) Systematic Elimination of Interaction Terms (30 points)

In this question, we are interested in predicting if the client subscribed to a term deposit or not after a phone call using age and education of the client and the day and the month the call took place.

Note that this is the same problem as in the previous assignment, however, using sklearn, we aim to make the predictive analysis with interactions more systematic.

a)

Read train.csv, test1.csv, and test2.csv. Prepare the training and two test datasets according the description above. (2 points)

b)

For all datasets:

  • One-hot-encode the categorical predictors. (2 points)
  • Get the interactions of all the predictors. (Numeric and one-hot-encoded) (3 points)
    • Note that there is a very quick way of doing this with PolynomialFeatures
    • Don’t forget to exclude the bias.
  • Scale the predictors (correctly.) (2 points)

c)

Train a Logistic Regression model with Lasso penalty. (2 points) The idea is to discard interactions that are not useful. Note that instead of the manual, trial-and-error way of adding interactions in statsmodels, we include all the possible interactions and then discard the useless ones here.

  • Use [0.0001,0.001, 0.01, 0.1, 1, 10, 100, 1000] as the possible C values. (1 point)
  • Use 10-fold cross-validation to optimize the C value. (1 point)
  • Lasso is very useful, but it needs special algorithms, since it includes non-differentiable absolute values. Use saga as the solver. (1 point)
  • For the same reason as above, the default number of iterations the algorithm takes is usually not enough for Lasso. Use max_iter = 1000. (Default is 500) (1 point)
  • This will take 10-20 minutes to run.

d)

How many models in total are run by this cross-validation process? (2 points)

e)

What is the optimum C value? (1 point) What is the lambda (in the Lasso cost) value it corresponds to? (1 point)

f)

What is the percentage of terms (linear or interaction) that are discarded by Lasso? (Hint: .coef_) (2 points)

g)

Find the five terms that have the highest effect on the logodds of a subscription. Assume that we are quantifying the effect of a term with the absolute value of its coefficient. (Hint: .get_feature_names_out()) (4 points)

h)

Come up with real-life explanations on why the terms identified in the previous part are important. (This is an open-ended question, just make sure your answer makes sense.) (2 points)

i)

Lastly, find a threshold to get all three (training and both test) datasets above 75% accuracy and 50% recall. Note that you only worry about the threshold now. Lasso took care of finding good interactions. (3 points)

4) Bonus: ElasticNet (11 points)

The entire goal of this part is to get you familiar with an alternate model: ElasticNet. It is implemented by adding both Lasso and Ridge penalties to the RSS (or subtracting them from the loglikelihood.) How much Lasso and how much Ridge penalty is up to two hyperparameters.

a)

For regression, sklearn has its own object and its CV version for ElasticNet.

Do your own research and implement a 5-fold cross-validation for the options of 25% Lasso-75% Ridge, 50% Lasso-50% Ridge and 75% Lasso- 25% Ridge, and the alpha values of alphas = np.logspace(10,0.1,200).

  • Use the dataset given in the first question (with the same columns dropped.)
  • You still need to scale.
  • Return the best Lasso-Ridge ratio and the alpha value pair that corresponds to the best test performance.
  • Note that even if you use the CV object, you have to use loops.

(8 points)

b)

How many models were run in the cross-validation process of two hyperparameters? (1 point)

c)

Briefly mention how you would implement ElasticNet for Logistic Regression. Again, you need to do your own research. (2 points)