Appendix D — Best Kaggle submission (Spring 2023)

Author

Jack O’ Keefe

import pandas as pd
import numpy as np
import statsmodels.formula.api as smf
import statsmodels.api as sm
import seaborn as sns
import matplotlib.pyplot as plt
from patsy import dmatrix
from sklearn.metrics import mean_squared_error
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split, cross_val_predict, cross_val_score
from sklearn.impute import KNNImputer
from sklearn.model_selection import KFold, train_test_split
from sklearn.tree import DecisionTreeRegressor
from sklearn.tree import export_graphviz 
from six import StringIO
from IPython.display import Image  
import pydotplus
import time as tm
from sklearn.metrics import mean_absolute_error
from sklearn.model_selection import KFold
from sklearn.model_selection import GridSearchCV, ParameterGrid
from sklearn.ensemble import BaggingRegressor,BaggingClassifier
import warnings
from sklearn.ensemble import RandomForestRegressor
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import roc_curve, precision_recall_curve, auc, make_scorer, \
recall_score, accuracy_score, precision_score, confusion_matrix
from sklearn.ensemble import BaggingRegressor,BaggingClassifier,RandomForestRegressor,RandomForestClassifier
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor, StackingRegressor
from sklearn.linear_model import Ridge
from xgboost import XGBRegressor
from sklearn.model_selection import KFold, GridSearchCV
from sklearn.neighbors import KNeighborsRegressor
from sklearn.experimental import enable_iterative_imputer  
from sklearn.impute import IterativeImputer

D.1 How this model works

This model was created in order to introduce bias to the base models of the stacked model. Each of these models individually will not offer a great solution, but a stacked model with these as the base will allow the overall model to work off the strengths of the individual models.

Four feature sets were used: The top predictors from a random forest model, with the number of features being decided by a forward step-wise 5-fold cv process (not in this code), where the most significant model was added first and the RMSE on the train data was calculated from there.

The top predictors from a MARS model: These were the significant predictors in a MARS model where X predicts log_y (no cv and the code is not present as I could only get it to work on Google Collab).

Top 40 predictors from kbest: I wanted to add another predictor subset, but was sttuggling to think of other ways. K-best is a function from sklearn, and provided different features from the MARS and RF best predictors so I decided to use it.

The entire predictor set: This was used to train a catboost regressor on the entire dataset, which is relatively quick and provided higher accuracy for the meta model to learn from.

D.2 How to use the pipeline

rf_pipe = Pipeline([
    ('column_transformer', ColumnTransformer([('rf_transform', 'passthrough', X_top_kbest.columns)], remainder='drop')),
    ('rf', rf_model)
])

The above code is used in the actual model, and is used as an example here. Pipeline is also a feature from sklearn that allows multiple functions to be executed in order. This can be used to streamline many processes.

In this case, there are two functions. The first is the ‘column_transformer,’ which looking back, I probably should have created a unique name for every transformation. This works by using ColumnTransformer, which takes the subset of columns specified, in this case X_top_kbest.columns. The ‘passthrough’ indicates that these are the columns to be worked on, with the remainder being dropped. ‘rf’ is the name in the pipeline I gave the rf_model. I gave unique names to every model depending on what subset they were working on. e.g. ‘rf1’ corresponds with the MARS predictors.

D.3 What the model does overall

As mentioned, the purpose of the model is to introduce bias to the base models, which will be overcome by the meta model. A good way to introduct bias is by using different predictors, and different models, which is what I did. I suspect that using other, more different models could work well, but I was not succesful in my attempts. This is a stacking regressor, which is different from a voting one. The difference is that there is a ‘meta model’ that corrects the error for the base models, by learning from the RMSE (or another specified loss function) on the base models. The meta model regressor I used was catboost, as it is the most accurate (I tried many other ways, and I thought that maybe ridge regression would work well, but it did not).

The specifics of the model are as follows: (Note, I have lightboost set up to be used in the model, but removing it actually made the model better. I suspect this is because catboost is a very similar model, but provides a lower RMSE)

The same, untuned models are run on each predictor subset. The models are random forest, catboost, and adaboost. These are all ran on the MARS subset, the rf subset, and the kbest subset. The catboost model is the only model ran on the entire dataset. 15 fold cross validation is used, which takes around 20-30 minutes to run. This model is not too slow because most of the models are run only on 20-40 predictors. It seems that the higher the k-fold, the better the models, but this is not true as the model gets worse after 15-fold cv. I suspect this is because the model will be testing on very little data for each split, which will not help RMSE, after a point.

The exact same model was also ran on a 13-fold cv. The purpose of this was to once again introduce bias. This introduces bias because this will create two different models, each with innacuracies of their own. By themselves, these models are still extremely useful.

These models were then combined by simply averaging the output, giving a low RMSE.

D.4 These models are not precisely reproducable

Because these models are untuned, they will give slightly different output every time they are run. To combat this, the models could be tuned for reproducibility, but that could take a very long time. On my first try I got my best RMSE, so it does not take very long. In my case, I saved the model by using pickle, a library to save models. The exact model I used can be uploaded at the very bottom of this under the ‘upload’ section.

D.5 Train Data Cleaning

train = pd.read_csv('train.csv')

y = train.y
x = train.drop("y", axis=1)
x = x.drop("id", axis=1)

x = x.apply(pd.to_numeric, errors='coerce')

y = y.apply(pd.to_numeric, errors='coerce')

from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_scaled = pd.DataFrame(scaler.fit_transform(x), columns=x.columns)

imputer = KNNImputer(n_neighbors=7)

X = pd.DataFrame(imputer.fit_transform(X_scaled), columns=x.columns)

log_y = np.log(y)

D.6 Get top predictors from untuned random forest

model = RandomForestRegressor(n_estimators = 100)
model.fit(X,log_y)

top_50_rf = model.feature_importances_.argsort()[-50:][::-1]

top_50_rf = pd.DataFrame(top_50_rf)

top_50_rf.columns = ['predictor']

D.7 Computed the best MARS features on google collab and am importing it as a csv

MARS_features = pd.read_csv('features.csv')

filtered_features = top_50_rf[~top_50_rf['predictor'].isin(MARS_features)]
new_df = filtered_features[['predictor']]

D.8 Above code should only take predictors from the random forest subset that are not present in MARS, I am not sure if it worked, however, so there might be some overlap

rf_top_30 = new_df[-30:][::-1]

top_predictors = rf_top_30['predictor']
X_top_rf = X.iloc[:,top_predictors]

top_predictors_MARS = MARS_features['predictor']
X_top_MARS = X[top_predictors_MARS]

D.9 Get further subset of features by selecting kbest

from sklearn.feature_selection import SelectKBest, f_regression

# Assuming X is your feature set and log_y is your target variable
selector = SelectKBest(f_regression, k=40)
X_new = selector.fit_transform(X, log_y)

# Get the indices sorted by most important to least important
indices = np.argsort(selector.scores_)[::-1]

# Get the names of the top 40 features
X_top_kbest = []
for i in range(40):
    X_top_kbest.append(X.columns[indices[i]])
    
X_top_kbest = X[X_top_kbest]

D.10 Test Cleaning

test = pd.read_csv('test.csv')

test_x = test.drop("id", axis=1)
test_x = test_x.apply(pd.to_numeric, errors='coerce')

scaler = StandardScaler()
test_X_scaled = pd.DataFrame(scaler.fit_transform(test_x), columns=test_x.columns)

imputer_KNN = KNNImputer(n_neighbors=7)
test_x = pd.DataFrame(imputer_KNN.fit_transform(test_X_scaled), columns=test_x.columns)

D.11 Create Base Models

from catboost import CatBoostRegressor
cat = CatBoostRegressor(verbose=False, random_seed=403)

rf_model = RandomForestRegressor()

from lightgbm import LGBMRegressor
light_model = LGBMRegressor(verbose=-1)

from sklearn.ensemble import BaggingRegressor,BaggingClassifier,AdaBoostRegressor,AdaBoostClassifier
ada_model = AdaBoostRegressor()

D.12 First Model

X_train, X_test, y_train_log, y_test_log = train_test_split(X, log_y, test_size = 0.2, random_state = 8)

from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.ensemble import StackingRegressor
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import KFold
from sklearn.metrics import mean_squared_error
import numpy as np

#kbest
cat_pipe = Pipeline([
    ('column_transformer', ColumnTransformer([('cat_transform', 'passthrough', X_top_kbest.columns)], remainder='drop')),
    ('cat', cat)
])

rf_pipe = Pipeline([
    ('column_transformer', ColumnTransformer([('rf_transform', 'passthrough', X_top_kbest.columns)], remainder='drop')),
    ('rf', rf_model)
])

ada_pipe = Pipeline([
    ('column_transformer', ColumnTransformer([('ada_transform', 'passthrough', X_top_kbest.columns)], remainder='drop')),
    ('ada', ada_model)
])

light_pipe = Pipeline([
    ('column_transformer', ColumnTransformer([('light_transform', 'passthrough', X_top_kbest.columns)], remainder='drop')),
    ('light', light_model)
])

cat_pipe1 = Pipeline([
    ('column_transformer', ColumnTransformer([('cat_transform1', 'passthrough', X_top_MARS.columns)], remainder='drop')),
    ('cat1', cat)
])

rf_pipe1 = Pipeline([
    ('column_transformer', ColumnTransformer([('rf_transform1', 'passthrough', X_top_MARS.columns)], remainder='drop')),
    ('rf1', rf_model)
])

ada_pipe1 = Pipeline([
    ('column_transformer', ColumnTransformer([('ada_transform1', 'passthrough', X_top_MARS.columns)], remainder='drop')),
    ('ada1', ada_model)
])

light_pipe1 = Pipeline([
    ('column_transformer', ColumnTransformer([('light_transform1', 'passthrough', X_top_MARS.columns)], remainder='drop')),
    ('light1', light_model)
])

cat_pipe2 = Pipeline([
    ('column_transformer', ColumnTransformer([('cat_transform2', 'passthrough', X_top_rf.columns)], remainder='drop')),
    ('cat2', cat)
])

rf_pipe2 = Pipeline([
    ('column_transformer', ColumnTransformer([('rf_transform2', 'passthrough', X_top_rf.columns)], remainder='drop')),
    ('rf2', rf_model)
])

ada_pipe2 = Pipeline([
    ('column_transformer', ColumnTransformer([('ada_transform2', 'passthrough', X_top_rf.columns)], remainder='drop')),
    ('ada2', ada_model)
])

light_pipe2 = Pipeline([
    ('column_transformer', ColumnTransformer([('light_transform2', 'passthrough', X_top_rf.columns)], remainder='drop')),
    ('light2', light_model)
])

cat_pipe3 = Pipeline([
    ('column_transformer', ColumnTransformer([('cat_transform3', 'passthrough', X.columns)], remainder='drop')),
    ('cat3', cat)
])

en_new = StackingRegressor(estimators = [('cat', cat_pipe),('rf', rf_pipe),('ada', ada_pipe), 
                                     ('cat1', cat_pipe1),('rf1', rf_pipe1),('ada1', ada_pipe1),
                                     ('cat2', cat_pipe2),('rf2', rf_pipe2),('ada2', ada_pipe2),
                                     ('cat3', cat_pipe3)],
                     final_estimator=CatBoostRegressor(),                                          
                    cv = KFold(n_splits = 15, shuffle = True, random_state=1))

en_new.fit(X_train, y_train_log)

D.13 Add intercept because model underestimates

new_intercept = np.mean(np.exp(y_test_log) - np.exp(en_new.predict(X_test)))

en_new.fit(X, log_y)

s = pd.DataFrame({'id':test.iloc[:, 0], "y":np.exp(en_new.predict(test_x)) + new_intercept})

D.14 Second Model

X_train, X_test, y_train_log, y_test_log = train_test_split(X, log_y, test_size = 0.2, random_state = 8)

from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.ensemble import StackingRegressor
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import KFold
from sklearn.metrics import mean_squared_error
import numpy as np

#kbest
cat_pipe = Pipeline([
    ('column_transformer', ColumnTransformer([('cat_transform', 'passthrough', X_top_kbest.columns)], remainder='drop')),
    ('cat', cat)
])

rf_pipe = Pipeline([
    ('column_transformer', ColumnTransformer([('rf_transform', 'passthrough', X_top_kbest.columns)], remainder='drop')),
    ('rf', rf_model)
])

ada_pipe = Pipeline([
    ('column_transformer', ColumnTransformer([('ada_transform', 'passthrough', X_top_kbest.columns)], remainder='drop')),
    ('ada', ada_model)
])

light_pipe = Pipeline([
    ('column_transformer', ColumnTransformer([('light_transform', 'passthrough', X_top_kbest.columns)], remainder='drop')),
    ('light', light_model)
])

cat_pipe1 = Pipeline([
    ('column_transformer', ColumnTransformer([('cat_transform1', 'passthrough', X_top_MARS.columns)], remainder='drop')),
    ('cat1', cat)
])

rf_pipe1 = Pipeline([
    ('column_transformer', ColumnTransformer([('rf_transform1', 'passthrough', X_top_MARS.columns)], remainder='drop')),
    ('rf1', rf_model)
])

ada_pipe1 = Pipeline([
    ('column_transformer', ColumnTransformer([('ada_transform1', 'passthrough', X_top_MARS.columns)], remainder='drop')),
    ('ada1', ada_model)
])

light_pipe1 = Pipeline([
    ('column_transformer', ColumnTransformer([('light_transform1', 'passthrough', X_top_MARS.columns)], remainder='drop')),
    ('light1', light_model)
])

cat_pipe2 = Pipeline([
    ('column_transformer', ColumnTransformer([('cat_transform2', 'passthrough', X_top_rf.columns)], remainder='drop')),
    ('cat2', cat)
])

rf_pipe2 = Pipeline([
    ('column_transformer', ColumnTransformer([('rf_transform2', 'passthrough', X_top_rf.columns)], remainder='drop')),
    ('rf2', rf_model)
])

ada_pipe2 = Pipeline([
    ('column_transformer', ColumnTransformer([('ada_transform2', 'passthrough', X_top_rf.columns)], remainder='drop')),
    ('ada2', ada_model)
])

light_pipe2 = Pipeline([
    ('column_transformer', ColumnTransformer([('light_transform2', 'passthrough', X_top_rf.columns)], remainder='drop')),
    ('light2', light_model)
])

cat_pipe3 = Pipeline([
    ('column_transformer', ColumnTransformer([('cat_transform3', 'passthrough', X.columns)], remainder='drop')),
    ('cat3', cat)
])

winner = StackingRegressor(estimators = [('cat', cat_pipe),('rf', rf_pipe),('ada', ada_pipe), 
                                     ('cat1', cat_pipe1),('rf1', rf_pipe1),('ada1', ada_pipe1),
                                     ('cat2', cat_pipe2),('rf2', rf_pipe2),('ada2', ada_pipe2),
                                     ('cat3', cat_pipe3)],
                     final_estimator=CatBoostRegressor(),                                          
                    cv = KFold(n_splits = 13, shuffle = True, random_state=1))

winner.fit(X_train, y_train_log)

winner_intercept = np.mean(np.exp(y_test_log) - np.exp(winner.predict(X_test)))

winner.fit(X, log_y)

game = pd.DataFrame({'id':test.iloc[:, 0], "y":np.exp(winner.predict(test_x)) + winner_intercept})

D.15 Create en ensemble by combining the models

x = game.copy()
x.y = s.y*.5 + game.y*.5

x.to_csv('xgame1.csv', index=False)