Assignment 3 (Section 20)

Instructions

  1. You may talk to a friend, discuss the questions and potential directions for solving them. However, you need to write your own solutions and code separately, and not as a group activity.

  2. Write your code in the Code cells and your answers in the Markdown cells of the Jupyter notebook. Ensure that the solution is written neatly enough to for the graders to understand and follow.

  3. Use Quarto to render the .ipynb file as HTML. You will need to open the command prompt, navigate to the directory containing the file, and use the command: quarto render filename.ipynb --to html. Submit the HTML file.

  4. The assignment is worth 100 points, and is due on Tuesday, 20th February 2024 at 11:59 pm.

  5. There is a bonus question worth 12 points.

  6. Five points are properly formatting the assignment. The breakdown is as follows:

    • Must be an HTML file rendered using Quarto (1 point). If you have a Quarto issue, you must mention the issue & quote the error you get when rendering using Quarto in the comments section of Canvas, and submit the ipynb file.
    • No name can be written on the assignment, nor can there be any indicator of the student’s identity—e.g. printouts of the working directory should not be included in the final submission. (1 point)
    • There aren’t excessively long outputs of extraneous information (e.g. no printouts of entire data frames without good reason, there aren’t long printouts of which iteration a loop is on, there aren’t long sections of commented-out code, etc.) (1 point)
    • Final answers to each question are written in the Markdown cells. (1 point)
    • There is no piece of unnecessary / redundant code, and no unnecessary / redundant text. (1 point)
  7. The maximum possible score in the assigment is 103 + 12 + 5 = 120 out of 100.

Introduction (0 points)

Read the train.csv, test1.csv, and test2.csv. All datasets are about direct marketing campaigns of a Portuguese banking institution. The marketing campaigns were based on phone calls, where bank clients were called to subscribe for a term deposit. Each observation is a phone call and each column is a variable about the client or the phone call. The columns are described as follows:

  1. age: Age of the client

  2. education: Education level of the client

  3. day: Day of the month the call is made

  4. month: Month of the call

  5. y: did the client subscribe to a term deposit? (This is the classification response.)

  6. duration: Call duration, in seconds. This variable highly affects the output (e.g., if duration=0 then y=‘no’). Yet, the duration is not known before a call is performed. Also, after the end of the call y is usually known. Therefore, it is a better idea that this variable should only be used for inference purposes and should be discarded if the intention is to have a realistic predictive model.

(Source: UCI Data Archive. Please use the given datasets for the assignment, not the raw data from the source. It is just for reference.)

1) Investigating the Effect of Call Duration on Subscription Probability (44 points)

1a)

First of all, you need a numeric response value for your statsmodels functions. (Numeric, not Boolean!) Convert the response column, y, of the training data (train.csv) into 0s for ‘no’ and 1s for ‘yes’. (2 points)

1b)

Using the training data and statsmodels, train a logistic regression model to predict if the client subscribed to a term deposit using the call duration. Print its summary. (2 points)

You need to use this model to answer all the remaining parts of this question.

1c)

Is the effect of the call duration on the probability of the client subscription statistically significant? Justify your answer. (2 points)

1d)

What is the probability of the client subscribing after a 5-minute marketing call? (Note that the duration variable is in given in seconds.) (3 points)

1e)

How many minutes are necessary to have at least 95% probability of the client subscribing? Print it in the following format: “… minutes or higher” (5 points: 4 points for the calculation, 1 point for formatting)

1f)

What is the percentage increase in the odds of a client subscribing when the call duration increases by a minute? (4 points)

1g)

How many minutes need to be added to a call to double the odds of a client subscribing? (3 points)

1h)

After exploring the model coefficients, it is time to see the limitations of this rather simplistic model.

What is the maximum call duration (in minutes) after which the client refused to subscribe? (Note that you need the dataset itself for this question, not the model.) (2 points) What is the subscription probability that the model predicts for that client? (2 points)

1i)

Use a scatterplot to visualize the data. You need to plot the classes against the duration values. (2 points) Add some small random noise (also called jitter) on the class values to visualize as many observations as possible. (1 point) On top of that, add the curve that the model fits to this data. (2 points) You should see a sigmoid fit without its bottom end. Does it look like the duration of the call is enough by itself to predict the client subscription? (1 point) Why or why not? (1 point)

1j)

Predict the accuracy and recall using a threshold of 0.5 for test_data1.csv and test_data2.csv. You should print 4 numbers. (4 points) You should see very different values for the different metrics. Explain why this is happening and which metric is a better evaluation of the prediction performance. (3 points)

Hint: Checking the value counts of the response variable might be a good idea.

1k)

Repeat the previous question with a threshold of 0.3. Did the accuracy change much? How about recall? Explain why the results changed (or not) in terms of the confusion matrix elements. (5 points: 1 point for the calculation, 2 points for the recall explanation, 2 points for the accuracy explanation)

2) Exploring Variable Interactions (10 points)

2a)

Using the training data and statsmodels, train a logistic regression model to predict if the client subscribed to a term deposit using the education level and the age. Assume that the effect of age on the log-odds depends on the education level of the client. Print the summary. (2 points)

You need to use this model to answer all the remaining parts of this question.

2b)

People with which level of education have the highest percentage increase in odds of a client subscribing with a unit increase in age? Justify your answer. (4 points)

2c)

What is the maximum age of a client with tertiary education to have 15% subscription probability or lower? (4 points)

3) Model Development and Evaluation (29 points)

3a)

Using the training data and statsmodels, train a logistic regression model to predict if the client subscribed to a term deposit using age, education, day, and month. The model must have:

  • A minimum of 75.0% accuracy for all three datasets. (train.csv, test1.csv, and test2.csv) (3 points)
  • A minimum of 50.0% recall for all three datasets. (6 points)

Print the model summary. (1 point) For all three datasets, print the accuracy scores, recall scores and confusion matrices. (3 points)

Notes:

  1. You cannot use duration as a predictor. The reason is explained in the description of the dataset in Introduction. (No credit from the entire question for models that use duration.)
  2. Explore some interactions and transformations. You do not need to go too high. (Still ok if you do and pass the given cutoffs.)
  3. You are free to choose the decision threshold as you wish. However, you must use the same threshold for all three datasets. (No credit from the entire question for using different thresholds.)
  4. No rounding. (For example, a recall of 49.9% is not considered correct.)

You need to use this model (and threshold, unless stated otherwise) to answer all the remaining parts of this question.

3b)

What is the probability that the model will predict a higher probability for a client who will subscribe compared to a client who will not? Justify your answer. (3 points)

3c)

Assume that you want to project all your prediction results for test1.csv and test2.csv to real-life profits. Assume that:

  • Only the clients who are predicted to subscribe are called.
  • A client who is called and subscribes returns a profit of $100.
  • A client who is called and does not subscribe returns a loss of $10.

What is the net profit of the results? (3 points) Note that you need to use the confusion matrices printed in part a.

3d)

Using the same assumptions in part c, find the threshold that would maximize the net profit. (5 points) Use the training data for this.

This is probably the most challenging part of this assignment. Here are some suggestions:

  • You do not need to calculate results for every possible threshold. You should already have some metrics calculated for a large array of thresholds in part b.
  • Use those metrics and the proportion of class 1 observations in the dataset to find the net profit for all thresholds.
  • Find the index of the highest profit to find the threshold, again using the threshold array from part b.

3e)

Using the new threshold you found in the previous part, calculate the net profit using test1.csv and test2.csv. (4 points)

3f)

Just an intuitive question: In a real-life setting like this, would you prefer the threshold you found in part a that maximizes some mathematical concepts or the threshold you found in part d that maximizes your profit? (1 point)

4) Sklearn (20 points)

Using train.csv and only sklearn, pandas, and numpy, train a Logistic Regression model. You need the following steps:

  • The response is still y. (1 point)
  • Predictors are education, month, day and age. (1 point)
  • Numerical predictors need to be transformed to all their second-order polynomial versions. (3 points)
  • Categorical predictors need to be one-hot-encoded. (2 points) They should not interact with the numerical predictors. (2 points)
  • Afterwards, the all the predictors needs to be standard scaled. (3 points)

Print the accuracy and recall for both training and test data using a threshold of 0.11. (5 points) Use test1.csv as the test dataset. Remember that the test dataset needs to go through the exact same transformation pipeline as the training dataset. (3 points)

5) Bonus: Data Visualization with Precision, Recall and FPR (12 points)

5a)

Plot the ROC curve for the model you trained in Question 3. Mark the point that corresponds to the decision threshold you found in 3d. (You can use a red solid point or whatever shape/color you want, as long as it is clearly marked.) Make sure you have the axis labels and the x=y line for comparison. (3 points)

5b)

Convert the previous plot to a scatter plot of TPR against FPR and color-code each point based on the corresponding profit. (5 points)

5c)

Plot the Precision-Recall curve for the model you trained in Question 3. Mark the decision threshold you found in 3d as a dashed vertical line. Make sure you have the axis labels. (4 points)