10  Alternative solutions by students

This chapter will consist of solutions suggested by students of the STAT303-1 Fall 2022 & Fall 2023 class, which are more efficient than those presented in the original version of the book.

10.1 Missing value imputation based on correlated variables in the data

The code below refers to Section 7.1.5.3 of the book. Students have proposed some ways to avoid a for loop in the code, which will lead to parallel computations, thereby saving execution time.

10.1.1 Original code

start_time = tm.time()
gdp_imputed_data = gdp_missing_values_data.copy()
for cont in avg_gdpPerCapita.index:
    gdp_imputed_data.loc[(gdp_imputed_data.continent==cont) & (gdp_imputed_data.gdpPerCapita.isnull()),
                     'gdpPerCapita']=avg_gdpPerCapita[cont]
plot_actual_vs_predicted()
print("Time taken to execute code: ",np.round(tm.time()-start_time,3), "seconds")
RMSE= 25473.20645170116
Time taken to execute code:  0.061 seconds

Below are some more efficient ways to impute the missing values, as they avoid using the for loop.

10.1.2 Alternative code 1:

By Ryan Yi

start_time = tm.time()
gdp_imputed_data = gdp_missing_values_data.copy()
gdp_imputed_data.gdpPerCapita = \
gdp_imputed_data['gdpPerCapita'].fillna(gdp_imputed_data['continent'].map(avg_gdpPerCapita))
plot_actual_vs_predicted()
print("Time taken to execute code: ",np.round(tm.time()-start_time,3), "seconds")
RMSE= 25473.20645170116
Time taken to execute code:  0.023 seconds

10.1.3 Alternative code 2:

By Victoria Shi

start_time = tm.time()
gdp_imputed_data = gdp_missing_values_data.copy()
gdp_imputed_data["gdpPerCapita"] =  gdp_missing_values_data[['continent','gdpPerCapita']].groupby('continent').transform(lambda x: x.fillna(x.mean()))
plot_actual_vs_predicted()
print("Time taken to execute code: ",np.round(tm.time()-start_time,3), "seconds")
RMSE= 25473.20645170116
Time taken to execute code:  0.026 seconds

10.1.4 Alternative code 3:

By Elijah Nacar

start_time = tm.time()
gdp_imputed_data = gdp_missing_values_data.copy()
gdp_imputed_data.gdpPerCapita = gdp_imputed_data.apply(lambda x: avg_gdpPerCapita[x['continent']] if pd.isnull(x['gdpPerCapita']) else x['gdpPerCapita'],axis=1)
plot_actual_vs_predicted()
print("Time taken to execute code: ",np.round(tm.time()-start_time,3), "seconds")
RMSE= 25473.20645170116
Time taken to execute code:  0.023 seconds

10.1.5 Alternative code 4:

By Erica Zhou

start_time = tm.time()
gdp_imputed_data = gdp_missing_values_data.copy()

cond_list = [gdp_imputed_data['continent'] == 'Africa',gdp_imputed_data['continent'] == 'Asia',
             gdp_imputed_data['continent'] == 'Europe',gdp_imputed_data['continent'] == 'North America',
             gdp_imputed_data['continent'] == 'Oceania',gdp_imputed_data['continent'] == 'South America']

choice_list = list(avg_gdpPerCapita)

gdp_imputed_data.gdpPerCapita = np.select(cond_list, choice_list, gdp_imputed_data.gdpPerCapita)
plot_actual_vs_predicted()
print("Time taken to execute code: ",np.round(tm.time()-start_time,3), "seconds")
RMSE= 25473.20645170116
Time taken to execute code:  0.022 seconds

10.2 Binning with equal sized bins

The code below refers to Section 7.2.2 of the book. A student has proposed a way to avoid the for loop in the code, which will lead to parallel computations, thereby saving execution time.

10.2.1 Original code (in the book)

#Bootstrapping to find 95% confidence intervals of Graduation Rate of US universities based on average expenditure per student
start_time = tm.time()
for expend_bin in college.Expend_bin.unique():
    data_sub = college.loc[college.Expend_bin==expend_bin,:]
    samples = np.random.choice(data_sub['Grad.Rate'], size=(10000,data_sub.shape[0]))
    print("95% Confidence interval of Grad.Rate for "+expend_bin+" univeristies = ["+str(np.round(np.percentile(samples.mean(axis=1),2.5),2))+","+str(np.round(np.percentile(samples.mean(axis=1),97.5),2))+"]")
print("Time taken to execute code: ",np.round(tm.time()-start_time,3), "seconds")
95% Confidence interval of Grad.Rate for Low expend univeristies = [55.31,59.35]
95% Confidence interval of Grad.Rate for High expend univeristies = [71.03,74.9]
95% Confidence interval of Grad.Rate for Med expend univeristies = [64.17,67.95]
Time taken to execute code:  0.151 seconds

10.2.2 Alternative code:

By Victoria Shi

start_time = tm.time()
def confidence_interval(df):
    samples = np.random.choice(df['Grad.Rate'], size=(10000, df.shape[0]))
    print("95% Confidence interval of Grad.Rate for "+df["Expend_bin"].iloc[0]+" univeristies = ["+str(np.round(np.percentile(samples.mean(axis=1),2.5),2))+","+str(np.round(np.percentile(samples.mean(axis=1),97.5),2))+"]")
college.groupby('Expend_bin').apply(confidence_interval)
print("Time taken to execute code: ",np.round(tm.time()-start_time,3), "seconds")
95% Confidence interval of Grad.Rate for Low expend univeristies = [55.35,59.35]
95% Confidence interval of Grad.Rate for Med expend univeristies = [64.16,67.95]
95% Confidence interval of Grad.Rate for High expend univeristies = [71.05,74.96]
Time taken to execute code:  0.139 seconds