Predicting Loan Status¶

This work is submitted for the final project of ML course on Cognitive Ai. The goal of this project is to demonstrate the usage of various classification algorithms and to generate a table with metrics for all the classification models used. A detailed description of the dataset used, alongside the entire pipeline can be found further below.

First, I show a snap-shot of the code necessary to derive the output without any comments, print statements, or other visualizations. Later, you can see the entire code explained in more detail.

Please use the toggle On/Off to see or hide code as you like.

from IPython.display import HTML

HTML('''<script>
code_show=true; 
function code_toggle() {
 if (code_show){
 $('div.input').hide();
 } else {
 $('div.input').show();
 }
 code_show = !code_show
} 
$( document ).ready(code_toggle);
</script>
<form action="javascript:code_toggle()"><input type="submit" value="Click here to toggle on/off the raw code."></form>''')

import itertools
import numpy as np
import pylab as pl
import pandas as pd
import scipy.optimize as opt
import matplotlib.pyplot as plt
from matplotlib.ticker import NullFormatter
import matplotlib.ticker as ticker
from sklearn import preprocessing
from sklearn.tree import DecisionTreeClassifier
from sklearn import svm
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix
from sklearn.metrics import jaccard_similarity_score
from sklearn.metrics import f1_score
from sklearn.metrics import log_loss
from sklearn.neighbors import KNeighborsClassifier
%matplotlib inline
import warnings
warnings.filterwarnings("ignore")

train_file_loc = 'C:\\Users\\Shaf\\Downloads\\loan_train.csv'
test_file_loc = 'C:\\Users\\Shaf\\Downloads\\loan_test.csv'

def best_classification_model (train_file, test_file):
    loans_df = pd.read_csv(train_file_loc)
    test_df = pd.read_csv(test_file_loc)
    k = 2
    loans_df['due_date'] = pd.to_datetime(loans_df['due_date'])
    loans_df['effective_date'] = pd.to_datetime(loans_df['effective_date'])
    loans_df['dayofweek'] = loans_df['effective_date'].dt.dayofweek
    loans_df['weekend'] = loans_df['dayofweek'].apply(lambda x: 1 if (x>3)  else 0)
    loans_df.groupby(['Gender'])['loan_status'].value_counts(normalize=True)
    loans_df['Gender'].replace(to_replace=['male','female'], value=[0,1],inplace=True)
    loans_df.groupby(['education'])['loan_status'].value_counts(normalize=True)
    Feature_train = loans_df[['Principal','terms','age','Gender','weekend']]
    Feature_train = pd.concat([Feature_train,pd.get_dummies(loans_df['education'])], axis=1)
    Feature_train.drop(['Master or Above'], axis = 1,inplace=True)
    X_train = Feature_train
    y_train = loans_df['loan_status'].values
    X_train = preprocessing.StandardScaler().fit(X_train).transform(X_train)
    test_df['due_date'] = pd.to_datetime(test_df['due_date'])
    test_df['effective_date'] = pd.to_datetime(test_df['effective_date'])
    test_df['dayofweek'] = test_df['effective_date'].dt.dayofweek
    test_df['weekend'] = test_df['dayofweek'].apply(lambda x: 1 if (x>3)  else 0)
    test_df.groupby(['Gender'])['loan_status'].value_counts(normalize=True)
    test_df['Gender'].replace(to_replace=['male','female'], value=[0,1],inplace=True)
    test_df.groupby(['education'])['loan_status'].value_counts(normalize=True)
    Feature_test = test_df[['Principal','terms','age','Gender','weekend']]
    Feature_test = pd.concat([Feature_test,pd.get_dummies(test_df['education'])], axis=1)
    Feature_test.drop(['Master or Above'], axis = 1,inplace=True)
    X_test = Feature_test
    y_test = test_df['loan_status'].values
    X_test = preprocessing.StandardScaler().fit(X_test).transform(X_test)

    neigh = KNeighborsClassifier(n_neighbors = k).fit(X_train,y_train)
    loanTree = DecisionTreeClassifier(criterion="entropy", max_depth = 4).fit(X_train,y_train)
    clf = svm.SVC(kernel='rbf').fit(X_train, y_train)
    LR = LogisticRegression(C=0.01, solver='liblinear').fit(X_train,y_train)

    models = [neigh, loanTree, clf, LR]
    modelnames = ['KNN', 'Decision Tree', 'SVM', 'Logistic Regression']
    data = []
    for model in models:
        if model != LR:
            yhat = model.predict(X_test)
            Jaccard = jaccard_similarity_score(y_test,yhat)
            F1 = f1_score(y_test, yhat, average='weighted')
            data.append([Jaccard, F1, "NaN"])
        else:
            yhat = model.predict(X_test)
            yhat_prob = LR.predict_proba(X_test)
            Jaccard = jaccard_similarity_score(y_test,yhat)
            F1 = f1_score(y_test, yhat, average='weighted')
            LogLoss = log_loss(y_test, yhat_prob)
            data.append([Jaccard, F1, LogLoss])
    final_predictions_df_new = pd.DataFrame(data, index = modelnames, columns=['Jaccard', 'F1_score','LogLoss'])
    max_Jaccard = final_predictions_df_new.Jaccard.argmax()
    max_F1 = final_predictions_df_new.F1_score.argmax()
    print(final_predictions_df_new, f'Model with the best Jaccard Score is {max_Jaccard} and Model with highest F1-score is {max_F1}.', sep='\n')

best_classification_model (train_file_loc, test_file_loc)

                      Jaccard  F1_score   LogLoss
KNN                  0.574074  0.600278       NaN
Decision Tree        0.777778  0.728395       NaN
SVM                  0.722222  0.621266       NaN
Logistic Regression  0.740741  0.630418  0.556608
Model with the best Jaccard Score is Decision Tree and Model with highest F1-score is Decision Tree.

Please find the detailed code with comments, exploratory data analysis, feature engineering, data visualizations, detailed model metrics, and debugging below.¶

# Let's import the basic dependencies required to visualize the data and to preprocess it.
import itertools
import numpy as np
import matplotlib.pyplot as plt
from matplotlib.ticker import NullFormatter
import pandas as pd
import numpy as np
import matplotlib.ticker as ticker
from sklearn import preprocessing
%matplotlib inline

About the Dataset

This dataset is about past loans. The Loan_train.csv data set includes details of 346 customers whose loan are already paid off or defaulted. It includes following fields:

Field	Description
Loan_status	Whether a loan is paid off on in collection
Principal	Basic principal loan amount at the
Terms	Origination terms which can be weekly (7 days), biweekly, and monthly payoff schedule
Effective_date	When the loan got originated and took effects
Due_date	Since it’s one-time payoff schedule, each loan has one single due date
Age	Age of applicant
Education	Education of applicant
Gender	The gender of applicant

Lets download the dataset.

Load Data From CSV File¶

# If you do not have access to the dataset - find it at the link above
loans_df = pd.read_csv('C:\\Users\\Shaf\\Downloads\\loan_train.csv')

# Let's inspect the first 5 rows of the data.
loans_df.head()

# Let's inspect the shape of the data
loans_df.shape

(346, 10)

Preprocessing of the data was done by the course instructor. I will keep it as is, for now. I might do some more processing later to improve the prediction accuracies as needed.

Convert to date time object¶

# Let's convert the data-time format of two columns - due_date and effective_date - to make it easier to work with later.
loans_df['due_date'] = pd.to_datetime(loans_df['due_date']) # convert datetime of due_data and overwrite
loans_df['effective_date'] = pd.to_datetime(loans_df['effective_date']) # convert datetime of effective_date and overwrite

# Let's take a look at the first 5 columns of the modified dataset
loans_df.head()

Data visualization and pre-processing.

Our target variable, 'loan_status', which we intend to predict has two classes.

paidoff
collection

Let’s look at the counts of each class to see if the classes are equally distributed.

# counts of each unique value in the column loan_status
loans_df['loan_status'].value_counts()

PAIDOFF       260
COLLECTION     86
Name: loan_status, dtype: int64

260 people have paid off the loan on time while 86 have gone into collection. Unfortunately, the data isn't distributed equally between the two classes.

Let's inspect the data using some visualizations to understand the distributions better.

# Let's import seaborn and visualize the distribution of the principal loan amounts by gender
import seaborn as sns

bins = np.linspace(loans_df.Principal.min(), loans_df.Principal.max(), 10)
g = sns.FacetGrid(loans_df, col="Gender", hue="loan_status", palette="Set1", col_wrap=2)
g.map(plt.hist, 'Principal', bins=bins, ec="k")

g.axes[-1].legend()
plt.show()

# Let's visualize the age of the borrower by gender
bins = np.linspace(loans_df.age.min(), loans_df.age.max(), 10)
g = sns.FacetGrid(loans_df, col="Gender", hue="loan_status", palette="Set1", col_wrap=2)
g.map(plt.hist, 'age', bins=bins, ec="k")

g.axes[-1].legend()
plt.show()

Pre-processing: Feature selection/extraction¶

Lets look at the day of the week people get the loan¶

# Let's visualize the day of the week when an individual gets a loan

# Let's first create the day of the week using effective_day column
loans_df['dayofweek'] = loans_df['effective_date'].dt.dayofweek

# Now let's create bins for each of the days (0-6, total 7, 1 for each day of the week)
bins = np.linspace(loans_df.dayofweek.min(), loans_df.dayofweek.max(), 10)


g = sns.FacetGrid(loans_df, col="Gender", hue="loan_status", palette="Set1", col_wrap=2)
g.map(plt.hist, 'dayofweek', bins=bins, ec="k")
g.axes[-1].legend()
plt.show()

We can use PairGrid method in seaborn to visualize the entire dataset instead. We can chose to look at bar plots, histograms, scatterplots...etc.

# Let's look at a pair grid plot to see the distributions of all variables in our dataframe.
g = sns.PairGrid(loans_df)
g = g.map(plt.bar)

g = sns.PairGrid(loans_df)
g = g.map(plt.scatter)

We can combine both types of charts. For instance, every chart plotted on the diagonal can be a histogram and everything off diagonal can be a scatter plot.¶

g = sns.PairGrid(loans_df)
g = g.map_diag(plt.hist)
g = g.map_offdiag(plt.scatter)

We can also use loan_status to highlight the two groups of interest.¶

g = sns.PairGrid(loans_df, hue="loan_status")
g = g.map_diag(plt.hist)
g = g.map_offdiag(plt.scatter)
g = g.add_legend()

We see that people who get the loan at the end of the week dont pay it off, so lets use Feature binarization to set a threshold values less then day 4

loans_df['weekend'] = loans_df['dayofweek'].apply(lambda x: 1 if (x>3)  else 0)
loans_df.head()

Convert Categorical features to numerical values¶

Lets look at gender:

loans_df.groupby(['Gender'])['loan_status'].value_counts(normalize=True)

Gender  loan_status
female  PAIDOFF        0.865385
        COLLECTION     0.134615
male    PAIDOFF        0.731293
        COLLECTION     0.268707
Name: loan_status, dtype: float64

86 % of female pay there loans while only 73 % of males pay there loan

Lets convert male to 0 and female to 1:

loans_df['Gender'].replace(to_replace=['male','female'], value=[0,1],inplace=True)
loans_df.head()

One Hot Encoding¶

How about education?¶

loans_df.groupby(['education'])['loan_status'].value_counts(normalize=True)

education             loan_status
Bechalor              PAIDOFF        0.750000
                      COLLECTION     0.250000
High School or Below  PAIDOFF        0.741722
                      COLLECTION     0.258278
Master or Above       COLLECTION     0.500000
                      PAIDOFF        0.500000
college               PAIDOFF        0.765101
                      COLLECTION     0.234899
Name: loan_status, dtype: float64

Feature befor One Hot Encoding¶

loans_df[['Principal','terms','age','Gender','education']].head()

Use one hot encoding technique to conver categorical varables to binary variables and append them to the feature Data Frame¶

Feature = loans_df[['Principal','terms','age','Gender','weekend']]
Feature = pd.concat([Feature,pd.get_dummies(loans_df['education'])], axis=1)
Feature.drop(['Master or Above'], axis = 1,inplace=True)
Feature.head()

Feature selection¶

Lets defind feature sets, X:

X = Feature
X[0:5]

What are our lables?

y = loans_df['loan_status'].values
y[0:5]

array(['PAIDOFF', 'PAIDOFF', 'PAIDOFF', 'PAIDOFF', 'PAIDOFF'],
      dtype=object)

Normalize Data¶

Data Standardization give data zero mean and unit variance (technically should be done after train test split )

X= preprocessing.StandardScaler().fit(X).transform(X)
X[0:5]

array([[ 0.51578458,  0.92071769,  2.33152555, -0.42056004, -1.20577805,
        -0.38170062,  1.13639374, -0.86968108],
       [ 0.51578458,  0.92071769,  0.34170148,  2.37778177, -1.20577805,
         2.61985426, -0.87997669, -0.86968108],
       [ 0.51578458, -0.95911111, -0.65321055, -0.42056004, -1.20577805,
        -0.38170062, -0.87997669,  1.14984679],
       [ 0.51578458,  0.92071769, -0.48739188,  2.37778177,  0.82934003,
        -0.38170062, -0.87997669,  1.14984679],
       [ 0.51578458,  0.92071769, -0.3215732 , -0.42056004,  0.82934003,
        -0.38170062, -0.87997669,  1.14984679]])

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
print (X_train.shape, y_train.shape)
print (X_test.shape, y_test.shape)

(276, 8) (276,)
(70, 8) (70,)

Classification Modeling

Now, it is your turn, use the training set to build an accurate model. Then use the test set to report the accuracy of the model You should use the following algorithm:

K Nearest Neighbor(KNN)
Decision Tree
Support Vector Machine
Logistic Regression

Notice:

You can go above and change the pre-processing, feature selection, feature-extraction, and so on, to make a better model.
You should use either scikit-learn, Scipy or Numpy libraries for developing the classification algorithms.
You should include the code of the algorithm in the following cells.

K Nearest Neighbor(KNN)

Notice: You should find the best k to build the model with the best accuracy.
warning: You should not use the loan_test.csv for finding the best k, however, you can split your train_loan.csv into train and test to find the best k.

from sklearn.neighbors import KNeighborsClassifier
# since we have two groups of interest (two classes), let's set out k to 2.
k = 2
#Train Model and Predict  
neigh = KNeighborsClassifier(n_neighbors = k).fit(X_train,y_train)
neigh

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
                     metric_params=None, n_jobs=None, n_neighbors=2, p=2,
                     weights='uniform')

yhat = neigh.predict(X_test)
yhat[0:5]

array(['PAIDOFF', 'PAIDOFF', 'COLLECTION', 'PAIDOFF', 'PAIDOFF'],
      dtype=object)

from sklearn import metrics
print("Train set Accuracy: ", metrics.accuracy_score(y_train, neigh.predict(X_train)))
print("Test set Accuracy: ", metrics.accuracy_score(y_test, yhat))

Train set Accuracy:  0.8152173913043478
Test set Accuracy:  0.6714285714285714

#for a good measure, let's inspect what happens when we use variable ks.
Ks = 10
mean_acc = np.zeros((Ks-1))
std_acc = np.zeros((Ks-1))
ConfustionMx = [];
for n in range(1,Ks):
    
    #Train Model and Predict  
    neigh = KNeighborsClassifier(n_neighbors = n).fit(X_train,y_train)
    yhat=neigh.predict(X_test)
    mean_acc[n-1] = metrics.accuracy_score(y_test, yhat)

    
    std_acc[n-1]=np.std(yhat==y_test)/np.sqrt(yhat.shape[0])

mean_acc

plt.plot(range(1,Ks),mean_acc,'g')
plt.fill_between(range(1,Ks),mean_acc - 1 * std_acc,mean_acc + 1 * std_acc, alpha=0.10)
plt.legend(('Accuracy ', '+/- 3xstd'))
plt.ylabel('Accuracy ')
plt.xlabel('Number of Neighbors (K)')
plt.tight_layout()
plt.show()

Decision Trees

from sklearn.tree import DecisionTreeClassifier
loanTree = DecisionTreeClassifier(criterion="entropy", max_depth = 4)
loanTree # it shows the default parameters

DecisionTreeClassifier(class_weight=None, criterion='entropy', max_depth=4,
                       max_features=None, max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, presort=False,
                       random_state=None, splitter='best')

loanTree.fit(X_train,y_train)

DecisionTreeClassifier(class_weight=None, criterion='entropy', max_depth=4,
                       max_features=None, max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, presort=False,
                       random_state=None, splitter='best')

predLoanTree = loanTree.predict(X_test)
print (predLoanTree [0:5])
print (y_test [0:5])

['PAIDOFF' 'PAIDOFF' 'COLLECTION' 'PAIDOFF' 'PAIDOFF']
['PAIDOFF' 'PAIDOFF' 'PAIDOFF' 'PAIDOFF' 'PAIDOFF']

from sklearn import metrics
import matplotlib.pyplot as plt
print("DecisionTrees's Accuracy: ", metrics.accuracy_score(y_test, predLoanTree))

DecisionTrees's Accuracy:  0.7

from sklearn.externals.six import StringIO
import pydotplus
import matplotlib.image as mpimg
from sklearn import tree
%matplotlib inline


dot_data = StringIO()
filename = "drugtree.png"
featureNames = Feature.columns[0:8]
targetNames = loans_df['loan_status'].unique().tolist()
out=tree.export_graphviz(loanTree,feature_names=featureNames, out_file=dot_data, class_names= np.unique(y_train), filled=True,  special_characters=True,rotate=False)  
graph = pydotplus.graph_from_dot_data(dot_data.getvalue())  
graph.write_png(filename)
img = mpimg.imread(filename)
plt.figure(figsize=(100, 200))
plt.imshow(img,interpolation='nearest')

<matplotlib.image.AxesImage at 0x1fd35d9d508>

Support Vector Machine

from sklearn import svm
clf = svm.SVC(kernel='rbf')
clf.fit(X_train, y_train)

SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
    decision_function_shape='ovr', degree=3, gamma='auto_deprecated',
    kernel='rbf', max_iter=-1, probability=False, random_state=None,
    shrinking=True, tol=0.001, verbose=False)

yhat = clf.predict(X_test)
yhat [0:5]

array(['PAIDOFF', 'PAIDOFF', 'PAIDOFF', 'PAIDOFF', 'PAIDOFF'],
      dtype=object)

from sklearn.metrics import classification_report, confusion_matrix
import itertools

def plot_confusion_matrix(cm, classes,
                          normalize=False,
                          title='Confusion matrix',
                          cmap=plt.cm.Blues):
    """
    This function prints and plots the confusion matrix.
    Normalization can be applied by setting `normalize=True`.
    """
    if normalize:
        cm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]
        print("Normalized confusion matrix")
    else:
        print('Confusion matrix, without normalization')

    print(cm)

    plt.imshow(cm, interpolation='nearest', cmap=cmap)
    plt.title(title)
    plt.colorbar()
    tick_marks = np.arange(len(classes))
    plt.xticks(tick_marks, classes, rotation=45)
    plt.yticks(tick_marks, classes)

    fmt = '.2f' if normalize else 'd'
    thresh = cm.max() / 2.
    for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):
        plt.text(j, i, format(cm[i, j], fmt),
                 horizontalalignment="center",
                 color="white" if cm[i, j] > thresh else "black")

    plt.tight_layout()
    plt.ylabel('True label')
    plt.xlabel('Predicted label')
    
    
# Compute confusion matrix
cnf_matrix = confusion_matrix(y_test, yhat, labels=['PAIDOFF','COLLECTION'])
np.set_printoptions(precision=2)

print (classification_report(y_test, yhat))

# Plot non-normalized confusion matrix
plt.figure()
plot_confusion_matrix(cnf_matrix, classes=['PAIDOFF','COLLECTION'],normalize= False,  title='Confusion matrix')

              precision    recall  f1-score   support

  COLLECTION       0.20      0.07      0.10        15
     PAIDOFF       0.78      0.93      0.85        55

    accuracy                           0.74        70
   macro avg       0.49      0.50      0.48        70
weighted avg       0.66      0.74      0.69        70

Confusion matrix, without normalization
[[51  4]
 [14  1]]

from sklearn.metrics import f1_score
f1_score(y_test, yhat, average='weighted')

0.6892857142857144

from sklearn.metrics import jaccard_similarity_score
jaccard_similarity_score(y_test, yhat)

0.7428571428571429

Logisitic Regression

import pylab as pl
import scipy.optimize as opt

from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix
LR = LogisticRegression(C=0.01, solver='liblinear').fit(X_train,y_train)
LR

LogisticRegression(C=0.01, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='warn', n_jobs=None, penalty='l2',
                   random_state=None, solver='liblinear', tol=0.0001, verbose=0,
                   warm_start=False)

yhat = LR.predict(X_test)
yhat

array(['PAIDOFF', 'PAIDOFF', 'PAIDOFF', 'PAIDOFF', 'PAIDOFF', 'PAIDOFF',
       'PAIDOFF', 'PAIDOFF', 'PAIDOFF', 'PAIDOFF', 'PAIDOFF', 'PAIDOFF',
       'PAIDOFF', 'PAIDOFF', 'PAIDOFF', 'PAIDOFF', 'PAIDOFF', 'PAIDOFF',
       'PAIDOFF', 'PAIDOFF', 'PAIDOFF', 'PAIDOFF', 'PAIDOFF', 'PAIDOFF',
       'PAIDOFF', 'PAIDOFF', 'PAIDOFF', 'PAIDOFF', 'PAIDOFF', 'PAIDOFF',
       'PAIDOFF', 'PAIDOFF', 'PAIDOFF', 'PAIDOFF', 'PAIDOFF', 'PAIDOFF',
       'PAIDOFF', 'PAIDOFF', 'PAIDOFF', 'PAIDOFF', 'PAIDOFF', 'PAIDOFF',
       'PAIDOFF', 'PAIDOFF', 'PAIDOFF', 'PAIDOFF', 'PAIDOFF', 'PAIDOFF',
       'PAIDOFF', 'PAIDOFF', 'PAIDOFF', 'COLLECTION', 'PAIDOFF',
       'PAIDOFF', 'PAIDOFF', 'PAIDOFF', 'PAIDOFF', 'PAIDOFF', 'PAIDOFF',
       'PAIDOFF', 'PAIDOFF', 'PAIDOFF', 'PAIDOFF', 'PAIDOFF', 'PAIDOFF',
       'PAIDOFF', 'PAIDOFF', 'PAIDOFF', 'PAIDOFF', 'PAIDOFF'],
      dtype=object)

yhat_prob = LR.predict_proba(X_test)
yhat_prob

array([[0.36, 0.64],
       [0.27, 0.73],
       [0.47, 0.53],
       [0.31, 0.69],
       [0.36, 0.64],
       [0.45, 0.55],
       [0.5 , 0.5 ],
       [0.36, 0.64],
       [0.45, 0.55],
       [0.4 , 0.6 ],
       [0.46, 0.54],
       [0.48, 0.52],
       [0.43, 0.57],
       [0.43, 0.57],
       [0.36, 0.64],
       [0.5 , 0.5 ],
       [0.45, 0.55],
       [0.46, 0.54],
       [0.3 , 0.7 ],
       [0.37, 0.63],
       [0.38, 0.62],
       [0.32, 0.68],
       [0.3 , 0.7 ],
       [0.27, 0.73],
       [0.46, 0.54],
       [0.45, 0.55],
       [0.31, 0.69],
       [0.3 , 0.7 ],
       [0.34, 0.66],
       [0.29, 0.71],
       [0.36, 0.64],
       [0.31, 0.69],
       [0.32, 0.68],
       [0.46, 0.54],
       [0.36, 0.64],
       [0.5 , 0.5 ],
       [0.34, 0.66],
       [0.32, 0.68],
       [0.43, 0.57],
       [0.47, 0.53],
       [0.47, 0.53],
       [0.47, 0.53],
       [0.43, 0.57],
       [0.3 , 0.7 ],
       [0.33, 0.67],
       [0.44, 0.56],
       [0.3 , 0.7 ],
       [0.36, 0.64],
       [0.36, 0.64],
       [0.47, 0.53],
       [0.3 , 0.7 ],
       [0.5 , 0.5 ],
       [0.32, 0.68],
       [0.45, 0.55],
       [0.36, 0.64],
       [0.44, 0.56],
       [0.35, 0.65],
       [0.47, 0.53],
       [0.37, 0.63],
       [0.4 , 0.6 ],
       [0.48, 0.52],
       [0.46, 0.54],
       [0.34, 0.66],
       [0.28, 0.72],
       [0.34, 0.66],
       [0.33, 0.67],
       [0.28, 0.72],
       [0.34, 0.66],
       [0.32, 0.68],
       [0.45, 0.55]])

from sklearn.metrics import jaccard_similarity_score
jaccard_similarity_score(y_test, yhat)

0.8

from sklearn.metrics import log_loss
log_loss(y_test, yhat_prob)

0.5317819704092389

Model Evaluation using Test set¶

from sklearn.metrics import jaccard_similarity_score
from sklearn.metrics import f1_score
from sklearn.metrics import log_loss

First, download and load the test set:

# !wget -O loan_test.csv https://s3-api.us-geo.objectstorage.softlayer.net/cf-courses-data/CognitiveClass/ML0101ENv3/labs/loan_test.csv

Load Test set for evaluation¶

test_df = pd.read_csv('C:\\Users\\Shaf\\Downloads\\loan_test.csv')
test_df.head()
test_df.columns

# Let's convert the data-time format of two columns - due_date and effective_date - to make it easier to work with later.
test_df['due_date'] = pd.to_datetime(test_df['due_date']) # convert datetime of due_data and overwrite
test_df['effective_date'] = pd.to_datetime(test_df['effective_date']) # convert datetime of effective_date and overwrite
test_df['dayofweek'] = test_df['effective_date'].dt.dayofweek
test_df['weekend'] = test_df['dayofweek'].apply(lambda x: 1 if (x>3)  else 0)
test_df.groupby(['Gender'])['loan_status'].value_counts(normalize=True)
test_df['Gender'].replace(to_replace=['male','female'], value=[0,1],inplace=True)
test_df.groupby(['education'])['loan_status'].value_counts(normalize=True)
Feature = test_df[['Principal','terms','age','Gender','weekend']]
Feature = pd.concat([Feature,pd.get_dummies(test_df['education'])], axis=1)
Feature.drop(['Master or Above'], axis = 1,inplace=True)
X = Feature
y = test_df['loan_status'].values
X = preprocessing.StandardScaler().fit(X).transform(X)

# Let's take a look at the first 5 columns of the modified dataset
test_df.head()

Feature.columns

Index(['Principal', 'terms', 'age', 'Gender', 'weekend', 'Bechalor',
       'High School or Below', 'college'],
      dtype='object')

yhat_prob = LR.predict_proba(X)
# yhat_prob

models = [neigh, loanTree, clf, LR]
modelnames = ['KNN', 'Decision Tree', 'SVM', 'Logistic Regression']
data = []
for model in models:
    if model != LR:
        yhat = model.predict(X)
        Jaccard = jaccard_similarity_score(y,yhat)
        F1 = f1_score(y, yhat, average='weighted')
        data.append([Jaccard, F1, "NaN"])
    else:
        yhat = model.predict(X)
        Jaccard = jaccard_similarity_score(y,yhat)
        F1 = f1_score(y, yhat, average='weighted')
        LogLoss = log_loss(y, yhat_prob)
        data.append([Jaccard, F1, LogLoss])
final_predictions_df_new = pd.DataFrame(data, index = modelnames, columns=['Jaccard', 'F1-score','LogLoss'])

final_predictions_df_new

Report¶

You should be able to report the accuracy of the built model using different evaluation metrics:

Algorithm	Jaccard	F1-score	LogLoss
KNN	?	?	NA
Decision Tree	?	?	NA
SVM	?	?	NA
LogisticRegression	?	?	?

We were able to output the results expected in the format expected. However, the work doesn't end here. Ideally, I would have improved the models by modifying the preprocessing steps.¶

Specifically, I would have first looked at the intercorrelations of the features to inspect if all included variables provide unique predictive value. Those that have high shared variances would have been dropped. I would test multiple Ks and Tree depths to investigate the optimal settings to improve learning. I would have implemented a cross-validation approach instead of a test/train split given the small sample size of our data. I would have used other approaches than a standard scalar to normalize data after checking for normal distributions. I would have ensured that the data has equal class sizes to minimize the bias in predictions. I would have created dummy variables for longer terms and higher principals based on the visualizations.

I will leave that work for a later time, hopefully, on a larger dataset.

	Unnamed: 0	Unnamed: 0.1	loan_status	Principal	terms	effective_date	due_date	age	education	Gender
0	0	0	PAIDOFF	1000	30	9/8/2016	10/7/2016	45	High School or Below	male
1	2	2	PAIDOFF	1000	30	9/8/2016	10/7/2016	33	Bechalor	female
2	3	3	PAIDOFF	1000	15	9/8/2016	9/22/2016	27	college	male
3	4	4	PAIDOFF	1000	30	9/9/2016	10/8/2016	28	college	female
4	6	6	PAIDOFF	1000	30	9/9/2016	10/8/2016	29	college	male

	Unnamed: 0	Unnamed: 0.1	loan_status	Principal	terms	effective_date	due_date	age	education	Gender
0	0	0	PAIDOFF	1000	30	2016-09-08	2016-10-07	45	High School or Below	male
1	2	2	PAIDOFF	1000	30	2016-09-08	2016-10-07	33	Bechalor	female
2	3	3	PAIDOFF	1000	15	2016-09-08	2016-09-22	27	college	male
3	4	4	PAIDOFF	1000	30	2016-09-09	2016-10-08	28	college	female
4	6	6	PAIDOFF	1000	30	2016-09-09	2016-10-08	29	college	male

	Jaccard	F1-score	LogLoss
KNN	0.648148	0.633396	NaN
Decision Tree	0.722222	0.718793	NaN
SVM	0.703704	0.637860	NaN
Logistic Regression	0.740741	0.630418	0.571456