Assignment 2 (Section 21)

Instructions

  1. You may talk to a friend, discuss the questions and potential directions for solving them. However, you need to write your own solutions and code separately, and not as a group activity.

  2. Write your code in the Code cells and your answers in the Markdown cells of the Jupyter notebook. Ensure that the solution is written neatly enough to for the graders to understand and follow.

  3. Use Quarto to render the .ipynb file as HTML. You will need to open the command prompt, navigate to the directory containing the file, and use the command: quarto render filename.ipynb --to html. Submit the HTML file.

  4. The assignment is worth 100 points, and is due on Monday, 22nd April 2024 at 11:59 pm.

  5. Five points are properly formatting the assignment. The breakdown is as follows:

    • Must be an HTML file rendered using Quarto (2 points). If you have a Quarto issue, you must mention the issue & quote the error you get when rendering using Quarto in the comments section of Canvas, and submit the ipynb file.
    • There aren’t excessively long outputs of extraneous information (e.g. no printouts of entire data frames without good reason, there aren’t long printouts of which iteration a loop is on, there aren’t long sections of commented-out code, etc.) (1 point)
    • Final answers to each question are written in the Markdown cells. (1 point)
    • There is no piece of unnecessary / redundant code, and no unnecessary / redundant text. (1 point)

1) Tuning a KNN Classifier with Sklearn Tools (40 points)

In this question, you will use classification_data.csv. Each row is a loan and the each column represents some financial information as follows:

  • hi_int_prncp_pd: Indicates if a high percentage of the repayments went to interest rather than principal. This is the classification response.

  • out_prncp_inv: Remaining outstanding principal for portion of total amount funded by investors

  • loan_amnt: The listed amount of the loan applied for by the borrower. If at some point in time, the credit department reduces the loan amount, then it will be reflected in this value.

  • int_rate: Interest Rate on the loan

  • term: The number of payments on the loan. Values are in months and can be either 36 or 60.

  • mort_acc: The number of mortgage accounts

  • application_type_Individual: 1 if the loan is an individual application or a joint application with two co-borrowers

  • tot_cur_bal: Total current balance of all accounts

  • pub_rec: Number of derogatory public records

As indicated above, hi_int_prncp_pd is the response and all the remaining columns are predictors. You will tune and train a K-Nearest Neighbors (KNN) classifier throughout this question.

1a)

Read the dataset. Create the predictor and the response variables.

Create the training and the test data with the following specifications: - The split should be 75%-25%. - You need to ensure that the class ratio is preserved in the training and the test datasets. i.e. the data is stratified. - Use random_state=45.

Print the class ratios of the entire dataset, the training set and the test set to check if the ratio is kept the same.

(1 point)

1b)

Scale the datasets. The data is ready for modeling at this point.

Before creating and tuning a model, you need to create a sklearn cross-validation object to ensure the most accurate representation of the data among all the folds.

Use the following specifications for your cross-validation settings: - Make sure the data is stratified in all the folds (Use StratifiedKFold()). - Use 5 folds. - Shuffle the data for more randomness. - Use random_state=14.

(1 point)

Note that you need to use these settings for the rest of this question (Q1) for consistency.

Cross-validate a KNN Classifier with the following specifications: - Use every odd K value between 1 and 50. (including 1) - Fix the weights at “uniform”, which is default. - Use the cv object you created in part 1(c). - Use accuracy as metric.

(4 points)

Print the best average cross-validation accuracy and the K value that corresponds to it. (2 points)

1c)

Using the optimal K value you found in part 1(b), find the threshold that maximizes the cross-validation accuracy with the following specifications:

  • Use all the possible threshold values with a stepsize of 0.01.
  • Use the cross-validation settings you created in part f.
  • Use accuracy as metric, which is default.

(4 points)

Print the best cross-validation accuracy (1 point) and the threshold value that corresponds to it. (1 points)

1d)

Is the method we used in parts 1(b) and 1(c) guaranteed to find the best K & threshold combination, i.e. tune the classifier to its best values? (1 point) Why or why not? (1 point)

1e)

Use the tuned classifier and threshold to find the test accuracy. (2 points) .

How does it compare to the cross-validation accuracy, i.e. is the model generalizing well? (1 point)

1f)

Now, you need to tune K and the threshold at the same time. Use the following specifications: - Use every odd K value between 1 and 50. (including 1) - Fix the weights at “uniform”. - Use all the possible threshold values with a stepsize of 0.01. - Use accuracy as metric.

(5 points)

Print the best cross-validation accuracy, and the K and threshold values that correspond to it. (1 point)

1g)

How does the best cross-validation accuracy in part 1(f) compare to parts 1(b) and 1(c)? (1 point) Did the K and threshold value change? (1 point) Explain why or why not. (2 points)

1h)

Use the tuned classifier and threshold from part 1(f) to find the test accuracy. (1 point)

1i)

Compare the methods you used in parts 1(b) & 1(c) with the method you used in part 1(f) in terms of computational power. How many K & threshold pairs did you try in both? (2 points) Combining your answer with the answer in part 1(i), explain the main trade-off while tuning a model. (2 points)

1j)

Cross-validate a KNN classifier with the following specifications: - Use every odd K value between 1 and 50. (including 1) - Fix the weights at “uniform” - Use accuracy, precision and recall as three metrics at the same time.

Find the K value that maximizes recall while having a precision above 75%. (3 points) Print the average cross-validation results of that K value. (1 point)

Which metric (among precision, recall, and accuracy) seems to be the least sensitive to the value of ‘K’. Why? (3 points)

2) Tuning a KNN Regressor with Sklearn Tools (55 points)

In this question, you will use bank_loan_train_data.csv to tune (the model hyperparameters) and train the model. Each row is a loan and the each column represents some financial information as follows:

  • money_made_inv: Indicates the amount of money made by the bank on the loan. This is the regression response.

  • out_prncp_inv: Remaining outstanding principal for portion of total amount funded by investors

  • loan_amnt: The listed amount of the loan applied for by the borrower. If at some point in time, the credit department reduces the loan amount, then it will be reflected in this value.

  • int_rate: Interest Rate on the loan

  • term: The number of payments on the loan. Values are in months and can be either 36 or 60

  • mort_acc: The number of mortgage accounts

  • application_type_Individual: 1 if the loan is an individual application or a joint application with two co-borrowers

  • tot_cur_bal: Total current balance of all accounts

  • pub_rec: Number of derogatory public records

As indicated above, money_made_inv is the response and all the remaining columns are predictors. You will tune and train a K-Nearest Neighbors (KNN) regressor throughout this question.

2a)

Find the optimal hyperparameter values and the corresponging optimal cross-validated RMSE. The hyperparameters that you must consider are

  1. Number of nearest neighbors,

  2. Weight of the neighbor, and

  3. the power p of the Minkowski distance.

For the weights hyperparameter, in addition to uniform and distance, consider 3 custom weights as well. The custom weights to consider are weight inversely proportional to distance squared, weight inversely proportional to distance cube, and weight inversely proportional to distance raised to the power of 4. Mathematically, these weights can be written as:

weight1weight \propto 1,

weight1distanceweight \propto \frac{1}{distance},

weight1distance2weight \propto \frac{1}{distance^2}

weight1distance3weight \propto \frac{1}{distance^3}

weight1distance4weight \propto \frac{1}{distance^4}

Show all the 3 search approaches - grid search, random search, and Bayes search. As this is a simple problem, all the 3 approaches should yield the same result.

For Bayes search, show the implementation of real-time monitoring of cross-validation error.

None of the cross-validation approaches should take more than a minute as this is a simlpe problem.

Hint:

Create three different user-defined functions. The functions should take one input, named distance and return 1/(1e-10+distance**n), where n is 2, 3, and 4, respectively. Note that the 1e-10 is to avoid computational overflow.

Name your functions, dist_power_n, where n is 2, 3, and 4, respectively. You can use these function names as the weights input to a KNN model.

(15 points)

2b)

Based on the optimal model in 2(a), find the RMSE on test data (bank_loan_test_data.csv). It must be less than $1400.

Note: You will achieve the test RMSE if you tuned the hyperparameters well in 2(a). If you did not, redo 2(a). You are not allowed to use test data for tuning the hyperparameter values.

(2 points)

2c)

KNN performance may deteriorate significantly if irrelevant predictors are included. We’ll add variable selection as well in the cross-validation procedure along with tuning of the hyperparameters for those variables.

Use a variable selection method to consider the best ‘r’ predictors, optimize the hyperparameters specified in 2(a), and compute the cross-validation error for those ‘r’ predictors. Note that ‘r’ will vary from 1 to 7, thus you will need to do 7 cross-validations - one for each ‘r’.

Report the optimal value of ‘r’, the ‘r’ predictors, the optimal hyperparameter values, and the optimal cross-validated RMSE.

You are free to use any search method.

Hint: You may use Lasso to consider the best ‘r’ predictors as that is the only variable selection you have learned so far.

(20 points)

2d)

Find the RMSE on test data based on the optimal model in 2(c). Your test RMSE must be less than $800.

Note: You will achieve the test RMSE if you tuned the hyperparameters well in 2(c). If you did not, redo 2(c). You are not allowed to use test data for tuning the hyperparameter values.

(2 points)

2e)

How did you decide the range of hyperparameter values to consider in this question? Discuss for p and n_neighbors.

(4 points)

2f)

Is it possible to futher improve the results if we also optimize the metric hyperparameter along with the hyperparameters specified in 2(a)? Why or why not?

(4 points)

2g)

What is the benefit of using the RepeatedKFold() function over the KFold() function of the model_selection module of the sklearn library? Explain in terms of bias-variance of test error. Did you observe any benefit of using RepeatedKFold() over KFold() in Q2? Why or why not?

(4 + 4 points)