Predicting Loan Status

This work is submitted for the final project of ML course on Cognitive Ai. The goal of this project is to demonstrate the usage of various classification algorithms and to generate a table with metrics for all the classification models used. A detailed description of the dataset used, alongside the entire pipeline can be found further below.

First, I show a snap-shot of the code necessary to derive the output without any comments, print statements, or other visualizations. Later, you can see the entire code explained in more detail.

Please use the toggle On/Off to see or hide code as you like.

In [1]:
from IPython.display import HTML

HTML('''<script>
code_show=true; 
function code_toggle() {
 if (code_show){
 $('div.input').hide();
 } else {
 $('div.input').show();
 }
 code_show = !code_show
} 
$( document ).ready(code_toggle);
</script>
<form action="javascript:code_toggle()"><input type="submit" value="Click here to toggle on/off the raw code."></form>''')
Out[1]:
In [2]:
import itertools
import numpy as np
import pylab as pl
import pandas as pd
import scipy.optimize as opt
import matplotlib.pyplot as plt
from matplotlib.ticker import NullFormatter
import matplotlib.ticker as ticker
from sklearn import preprocessing
from sklearn.tree import DecisionTreeClassifier
from sklearn import svm
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix
from sklearn.metrics import jaccard_similarity_score
from sklearn.metrics import f1_score
from sklearn.metrics import log_loss
from sklearn.neighbors import KNeighborsClassifier
%matplotlib inline
import warnings
warnings.filterwarnings("ignore")

train_file_loc = 'C:\\Users\\Shaf\\Downloads\\loan_train.csv'
test_file_loc = 'C:\\Users\\Shaf\\Downloads\\loan_test.csv'
In [3]:
def best_classification_model (train_file, test_file):
    loans_df = pd.read_csv(train_file_loc)
    test_df = pd.read_csv(test_file_loc)
    k = 2
    loans_df['due_date'] = pd.to_datetime(loans_df['due_date'])
    loans_df['effective_date'] = pd.to_datetime(loans_df['effective_date'])
    loans_df['dayofweek'] = loans_df['effective_date'].dt.dayofweek
    loans_df['weekend'] = loans_df['dayofweek'].apply(lambda x: 1 if (x>3)  else 0)
    loans_df.groupby(['Gender'])['loan_status'].value_counts(normalize=True)
    loans_df['Gender'].replace(to_replace=['male','female'], value=[0,1],inplace=True)
    loans_df.groupby(['education'])['loan_status'].value_counts(normalize=True)
    Feature_train = loans_df[['Principal','terms','age','Gender','weekend']]
    Feature_train = pd.concat([Feature_train,pd.get_dummies(loans_df['education'])], axis=1)
    Feature_train.drop(['Master or Above'], axis = 1,inplace=True)
    X_train = Feature_train
    y_train = loans_df['loan_status'].values
    X_train = preprocessing.StandardScaler().fit(X_train).transform(X_train)
    test_df['due_date'] = pd.to_datetime(test_df['due_date'])
    test_df['effective_date'] = pd.to_datetime(test_df['effective_date'])
    test_df['dayofweek'] = test_df['effective_date'].dt.dayofweek
    test_df['weekend'] = test_df['dayofweek'].apply(lambda x: 1 if (x>3)  else 0)
    test_df.groupby(['Gender'])['loan_status'].value_counts(normalize=True)
    test_df['Gender'].replace(to_replace=['male','female'], value=[0,1],inplace=True)
    test_df.groupby(['education'])['loan_status'].value_counts(normalize=True)
    Feature_test = test_df[['Principal','terms','age','Gender','weekend']]
    Feature_test = pd.concat([Feature_test,pd.get_dummies(test_df['education'])], axis=1)
    Feature_test.drop(['Master or Above'], axis = 1,inplace=True)
    X_test = Feature_test
    y_test = test_df['loan_status'].values
    X_test = preprocessing.StandardScaler().fit(X_test).transform(X_test)

    neigh = KNeighborsClassifier(n_neighbors = k).fit(X_train,y_train)
    loanTree = DecisionTreeClassifier(criterion="entropy", max_depth = 4).fit(X_train,y_train)
    clf = svm.SVC(kernel='rbf').fit(X_train, y_train)
    LR = LogisticRegression(C=0.01, solver='liblinear').fit(X_train,y_train)

    models = [neigh, loanTree, clf, LR]
    modelnames = ['KNN', 'Decision Tree', 'SVM', 'Logistic Regression']
    data = []
    for model in models:
        if model != LR:
            yhat = model.predict(X_test)
            Jaccard = jaccard_similarity_score(y_test,yhat)
            F1 = f1_score(y_test, yhat, average='weighted')
            data.append([Jaccard, F1, "NaN"])
        else:
            yhat = model.predict(X_test)
            yhat_prob = LR.predict_proba(X_test)
            Jaccard = jaccard_similarity_score(y_test,yhat)
            F1 = f1_score(y_test, yhat, average='weighted')
            LogLoss = log_loss(y_test, yhat_prob)
            data.append([Jaccard, F1, LogLoss])
    final_predictions_df_new = pd.DataFrame(data, index = modelnames, columns=['Jaccard', 'F1_score','LogLoss'])
    max_Jaccard = final_predictions_df_new.Jaccard.argmax()
    max_F1 = final_predictions_df_new.F1_score.argmax()
    print(final_predictions_df_new, f'Model with the best Jaccard Score is {max_Jaccard} and Model with highest F1-score is {max_F1}.', sep='\n')
In [4]:
best_classification_model (train_file_loc, test_file_loc)
                      Jaccard  F1_score   LogLoss
KNN                  0.574074  0.600278       NaN
Decision Tree        0.777778  0.728395       NaN
SVM                  0.722222  0.621266       NaN
Logistic Regression  0.740741  0.630418  0.556608
Model with the best Jaccard Score is Decision Tree and Model with highest F1-score is Decision Tree.

Please find the detailed code with comments, exploratory data analysis, feature engineering, data visualizations, detailed model metrics, and debugging below.

In [5]:
# Let's import the basic dependencies required to visualize the data and to preprocess it.
import itertools
import numpy as np
import matplotlib.pyplot as plt
from matplotlib.ticker import NullFormatter
import pandas as pd
import numpy as np
import matplotlib.ticker as ticker
from sklearn import preprocessing
%matplotlib inline

About the Dataset

This dataset is about past loans. The Loan_train.csv data set includes details of 346 customers whose loan are already paid off or defaulted. It includes following fields:

Field Description
Loan_status Whether a loan is paid off on in collection
Principal Basic principal loan amount at the
Terms Origination terms which can be weekly (7 days), biweekly, and monthly payoff schedule
Effective_date When the loan got originated and took effects
Due_date Since it’s one-time payoff schedule, each loan has one single due date
Age Age of applicant
Education Education of applicant
Gender The gender of applicant

Lets download the dataset.

Load Data From CSV File

In [7]:
# If you do not have access to the dataset - find it at the link above
loans_df = pd.read_csv('C:\\Users\\Shaf\\Downloads\\loan_train.csv')
In [8]:
# Let's inspect the first 5 rows of the data.
loans_df.head()
Out[8]:
Unnamed: 0 Unnamed: 0.1 loan_status Principal terms effective_date due_date age education Gender
0 0 0 PAIDOFF 1000 30 9/8/2016 10/7/2016 45 High School or Below male
1 2 2 PAIDOFF 1000 30 9/8/2016 10/7/2016 33 Bechalor female
2 3 3 PAIDOFF 1000 15 9/8/2016 9/22/2016 27 college male
3 4 4 PAIDOFF 1000 30 9/9/2016 10/8/2016 28 college female
4 6 6 PAIDOFF 1000 30 9/9/2016 10/8/2016 29 college male
In [9]:
# Let's inspect the shape of the data
loans_df.shape
Out[9]:
(346, 10)

Preprocessing of the data was done by the course instructor. I will keep it as is, for now. I might do some more processing later to improve the prediction accuracies as needed.

Convert to date time object

In [10]:
# Let's convert the data-time format of two columns - due_date and effective_date - to make it easier to work with later.
loans_df['due_date'] = pd.to_datetime(loans_df['due_date']) # convert datetime of due_data and overwrite
loans_df['effective_date'] = pd.to_datetime(loans_df['effective_date']) # convert datetime of effective_date and overwrite

# Let's take a look at the first 5 columns of the modified dataset
loans_df.head()
Out[10]:
Unnamed: 0 Unnamed: 0.1 loan_status Principal terms effective_date due_date age education Gender
0 0 0 PAIDOFF 1000 30 2016-09-08 2016-10-07 45 High School or Below male
1 2 2 PAIDOFF 1000 30 2016-09-08 2016-10-07 33 Bechalor female
2 3 3 PAIDOFF 1000 15 2016-09-08 2016-09-22 27 college male
3 4 4 PAIDOFF 1000 30 2016-09-09 2016-10-08 28 college female
4 6 6 PAIDOFF 1000 30 2016-09-09 2016-10-08 29 college male

Data visualization and pre-processing.

Our target variable, 'loan_status', which we intend to predict has two classes.

  1. paidoff
  2. collection

Let’s look at the counts of each class to see if the classes are equally distributed.

In [11]:
# counts of each unique value in the column loan_status
loans_df['loan_status'].value_counts()
Out[11]:
PAIDOFF       260
COLLECTION     86
Name: loan_status, dtype: int64

260 people have paid off the loan on time while 86 have gone into collection. Unfortunately, the data isn't distributed equally between the two classes.

Let's inspect the data using some visualizations to understand the distributions better.

In [12]:
# Let's import seaborn and visualize the distribution of the principal loan amounts by gender
import seaborn as sns

bins = np.linspace(loans_df.Principal.min(), loans_df.Principal.max(), 10)
g = sns.FacetGrid(loans_df, col="Gender", hue="loan_status", palette="Set1", col_wrap=2)
g.map(plt.hist, 'Principal', bins=bins, ec="k")

g.axes[-1].legend()
plt.show()
In [13]:
# Let's visualize the age of the borrower by gender
bins = np.linspace(loans_df.age.min(), loans_df.age.max(), 10)
g = sns.FacetGrid(loans_df, col="Gender", hue="loan_status", palette="Set1", col_wrap=2)
g.map(plt.hist, 'age', bins=bins, ec="k")

g.axes[-1].legend()
plt.show()

Pre-processing: Feature selection/extraction

Lets look at the day of the week people get the loan

In [14]:
# Let's visualize the day of the week when an individual gets a loan

# Let's first create the day of the week using effective_day column
loans_df['dayofweek'] = loans_df['effective_date'].dt.dayofweek

# Now let's create bins for each of the days (0-6, total 7, 1 for each day of the week)
bins = np.linspace(loans_df.dayofweek.min(), loans_df.dayofweek.max(), 10)


g = sns.FacetGrid(loans_df, col="Gender", hue="loan_status", palette="Set1", col_wrap=2)
g.map(plt.hist, 'dayofweek', bins=bins, ec="k")
g.axes[-1].legend()
plt.show()

We can use PairGrid method in seaborn to visualize the entire dataset instead. We can chose to look at bar plots, histograms, scatterplots...etc.

In [15]:
# Let's look at a pair grid plot to see the distributions of all variables in our dataframe.
g = sns.PairGrid(loans_df)
g = g.map(plt.bar)
In [16]:
g = sns.PairGrid(loans_df)
g = g.map(plt.scatter)
We can combine both types of charts. For instance, every chart plotted on the diagonal can be a histogram and everything off diagonal can be a scatter plot.
In [17]:
g = sns.PairGrid(loans_df)
g = g.map_diag(plt.hist)
g = g.map_offdiag(plt.scatter)

We can also use loan_status to highlight the two groups of interest.

In [18]:
g = sns.PairGrid(loans_df, hue="loan_status")
g = g.map_diag(plt.hist)
g = g.map_offdiag(plt.scatter)
g = g.add_legend()

We see that people who get the loan at the end of the week dont pay it off, so lets use Feature binarization to set a threshold values less then day 4

In [19]:
loans_df['weekend'] = loans_df['dayofweek'].apply(lambda x: 1 if (x>3)  else 0)
loans_df.head()
Out[19]:
Unnamed: 0 Unnamed: 0.1 loan_status Principal terms effective_date due_date age education Gender dayofweek weekend
0 0 0 PAIDOFF 1000 30 2016-09-08 2016-10-07 45 High School or Below male 3 0
1 2 2 PAIDOFF 1000 30 2016-09-08 2016-10-07 33 Bechalor female 3 0
2 3 3 PAIDOFF 1000 15 2016-09-08 2016-09-22 27 college male 3 0
3 4 4 PAIDOFF 1000 30 2016-09-09 2016-10-08 28 college female 4 1
4 6 6 PAIDOFF 1000 30 2016-09-09 2016-10-08 29 college male 4 1

Convert Categorical features to numerical values

Lets look at gender:

In [20]:
loans_df.groupby(['Gender'])['loan_status'].value_counts(normalize=True)
Out[20]:
Gender  loan_status
female  PAIDOFF        0.865385
        COLLECTION     0.134615
male    PAIDOFF        0.731293
        COLLECTION     0.268707
Name: loan_status, dtype: float64

86 % of female pay there loans while only 73 % of males pay there loan

Lets convert male to 0 and female to 1:

In [21]:
loans_df['Gender'].replace(to_replace=['male','female'], value=[0,1],inplace=True)
loans_df.head()
Out[21]:
Unnamed: 0 Unnamed: 0.1 loan_status Principal terms effective_date due_date age education Gender dayofweek weekend
0 0 0 PAIDOFF 1000 30 2016-09-08 2016-10-07 45 High School or Below 0 3 0
1 2 2 PAIDOFF 1000 30 2016-09-08 2016-10-07 33 Bechalor 1 3 0
2 3 3 PAIDOFF 1000 15 2016-09-08 2016-09-22 27 college 0 3 0
3 4 4 PAIDOFF 1000 30 2016-09-09 2016-10-08 28 college 1 4 1
4 6 6 PAIDOFF 1000 30 2016-09-09 2016-10-08 29 college 0 4 1

One Hot Encoding

How about education?

In [22]:
loans_df.groupby(['education'])['loan_status'].value_counts(normalize=True)
Out[22]:
education             loan_status
Bechalor              PAIDOFF        0.750000
                      COLLECTION     0.250000
High School or Below  PAIDOFF        0.741722
                      COLLECTION     0.258278
Master or Above       COLLECTION     0.500000
                      PAIDOFF        0.500000
college               PAIDOFF        0.765101
                      COLLECTION     0.234899
Name: loan_status, dtype: float64

Feature befor One Hot Encoding

In [23]:
loans_df[['Principal','terms','age','Gender','education']].head()
Out[23]:
Principal terms age Gender education
0 1000 30 45 0 High School or Below
1 1000 30 33 1 Bechalor
2 1000 15 27 0 college
3 1000 30 28 1 college
4 1000 30 29 0 college

Use one hot encoding technique to conver categorical varables to binary variables and append them to the feature Data Frame

In [24]:
Feature = loans_df[['Principal','terms','age','Gender','weekend']]
Feature = pd.concat([Feature,pd.get_dummies(loans_df['education'])], axis=1)
Feature.drop(['Master or Above'], axis = 1,inplace=True)
Feature.head()
Out[24]:
Principal terms age Gender weekend Bechalor High School or Below college
0 1000 30 45 0 0 0 1 0
1 1000 30 33 1 0 1 0 0
2 1000 15 27 0 0 0 0 1
3 1000 30 28 1 1 0 0 1
4 1000 30 29 0 1 0 0 1

Feature selection

Lets defind feature sets, X:

In [25]:
X = Feature
X[0:5]
Out[25]:
Principal terms age Gender weekend Bechalor High School or Below college
0 1000 30 45 0 0 0 1 0
1 1000 30 33 1 0 1 0 0
2 1000 15 27 0 0 0 0 1
3 1000 30 28 1 1 0 0 1
4 1000 30 29 0 1 0 0 1

What are our lables?

In [26]:
y = loans_df['loan_status'].values
y[0:5]
Out[26]:
array(['PAIDOFF', 'PAIDOFF', 'PAIDOFF', 'PAIDOFF', 'PAIDOFF'],
      dtype=object)

Normalize Data

Data Standardization give data zero mean and unit variance (technically should be done after train test split )

In [27]:
X= preprocessing.StandardScaler().fit(X).transform(X)
X[0:5]
Out[27]:
array([[ 0.51578458,  0.92071769,  2.33152555, -0.42056004, -1.20577805,
        -0.38170062,  1.13639374, -0.86968108],
       [ 0.51578458,  0.92071769,  0.34170148,  2.37778177, -1.20577805,
         2.61985426, -0.87997669, -0.86968108],
       [ 0.51578458, -0.95911111, -0.65321055, -0.42056004, -1.20577805,
        -0.38170062, -0.87997669,  1.14984679],
       [ 0.51578458,  0.92071769, -0.48739188,  2.37778177,  0.82934003,
        -0.38170062, -0.87997669,  1.14984679],
       [ 0.51578458,  0.92071769, -0.3215732 , -0.42056004,  0.82934003,
        -0.38170062, -0.87997669,  1.14984679]])
In [28]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
print (X_train.shape, y_train.shape)
print (X_test.shape, y_test.shape)
(276, 8) (276,)
(70, 8) (70,)

Classification Modeling

Now, it is your turn, use the training set to build an accurate model. Then use the test set to report the accuracy of the model You should use the following algorithm:

  • K Nearest Neighbor(KNN)
  • Decision Tree
  • Support Vector Machine
  • Logistic Regression

Notice:

  • You can go above and change the pre-processing, feature selection, feature-extraction, and so on, to make a better model.
  • You should use either scikit-learn, Scipy or Numpy libraries for developing the classification algorithms.
  • You should include the code of the algorithm in the following cells.

K Nearest Neighbor(KNN)

Notice: You should find the best k to build the model with the best accuracy.
warning: You should not use the loan_test.csv for finding the best k, however, you can split your train_loan.csv into train and test to find the best k.

In [29]:
from sklearn.neighbors import KNeighborsClassifier
# since we have two groups of interest (two classes), let's set out k to 2.
k = 2
#Train Model and Predict  
neigh = KNeighborsClassifier(n_neighbors = k).fit(X_train,y_train)
neigh
Out[29]:
KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
                     metric_params=None, n_jobs=None, n_neighbors=2, p=2,
                     weights='uniform')
In [30]:
yhat = neigh.predict(X_test)
yhat[0:5]
Out[30]:
array(['PAIDOFF', 'PAIDOFF', 'COLLECTION', 'PAIDOFF', 'PAIDOFF'],
      dtype=object)
In [31]:
from sklearn import metrics
print("Train set Accuracy: ", metrics.accuracy_score(y_train, neigh.predict(X_train)))
print("Test set Accuracy: ", metrics.accuracy_score(y_test, yhat))
Train set Accuracy:  0.8152173913043478
Test set Accuracy:  0.6714285714285714
In [35]:
#for a good measure, let's inspect what happens when we use variable ks.
Ks = 10
mean_acc = np.zeros((Ks-1))
std_acc = np.zeros((Ks-1))
ConfustionMx = [];
for n in range(1,Ks):
    
    #Train Model and Predict  
    neigh = KNeighborsClassifier(n_neighbors = n).fit(X_train,y_train)
    yhat=neigh.predict(X_test)
    mean_acc[n-1] = metrics.accuracy_score(y_test, yhat)

    
    std_acc[n-1]=np.std(yhat==y_test)/np.sqrt(yhat.shape[0])

mean_acc

plt.plot(range(1,Ks),mean_acc,'g')
plt.fill_between(range(1,Ks),mean_acc - 1 * std_acc,mean_acc + 1 * std_acc, alpha=0.10)
plt.legend(('Accuracy ', '+/- 3xstd'))
plt.ylabel('Accuracy ')
plt.xlabel('Number of Neighbors (K)')
plt.tight_layout()
plt.show()

Decision Trees

In [36]:
from sklearn.tree import DecisionTreeClassifier
loanTree = DecisionTreeClassifier(criterion="entropy", max_depth = 4)
loanTree # it shows the default parameters
Out[36]:
DecisionTreeClassifier(class_weight=None, criterion='entropy', max_depth=4,
                       max_features=None, max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, presort=False,
                       random_state=None, splitter='best')
In [37]:
loanTree.fit(X_train,y_train)
Out[37]:
DecisionTreeClassifier(class_weight=None, criterion='entropy', max_depth=4,
                       max_features=None, max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, presort=False,
                       random_state=None, splitter='best')
In [38]:
predLoanTree = loanTree.predict(X_test)
print (predLoanTree [0:5])
print (y_test [0:5])
['PAIDOFF' 'PAIDOFF' 'COLLECTION' 'PAIDOFF' 'PAIDOFF']
['PAIDOFF' 'PAIDOFF' 'PAIDOFF' 'PAIDOFF' 'PAIDOFF']
In [39]:
from sklearn import metrics
import matplotlib.pyplot as plt
print("DecisionTrees's Accuracy: ", metrics.accuracy_score(y_test, predLoanTree))
DecisionTrees's Accuracy:  0.7
In [40]:
from sklearn.externals.six import StringIO
import pydotplus
import matplotlib.image as mpimg
from sklearn import tree
%matplotlib inline


dot_data = StringIO()
filename = "drugtree.png"
featureNames = Feature.columns[0:8]
targetNames = loans_df['loan_status'].unique().tolist()
out=tree.export_graphviz(loanTree,feature_names=featureNames, out_file=dot_data, class_names= np.unique(y_train), filled=True,  special_characters=True,rotate=False)  
graph = pydotplus.graph_from_dot_data(dot_data.getvalue())  
graph.write_png(filename)
img = mpimg.imread(filename)
plt.figure(figsize=(100, 200))
plt.imshow(img,interpolation='nearest')
Out[40]:
<matplotlib.image.AxesImage at 0x1fd35d9d508>

Support Vector Machine

In [41]:
from sklearn import svm
clf = svm.SVC(kernel='rbf')
clf.fit(X_train, y_train) 
Out[41]:
SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
    decision_function_shape='ovr', degree=3, gamma='auto_deprecated',
    kernel='rbf', max_iter=-1, probability=False, random_state=None,
    shrinking=True, tol=0.001, verbose=False)
In [42]:
yhat = clf.predict(X_test)
yhat [0:5]
Out[42]:
array(['PAIDOFF', 'PAIDOFF', 'PAIDOFF', 'PAIDOFF', 'PAIDOFF'],
      dtype=object)
In [43]:
from sklearn.metrics import classification_report, confusion_matrix
import itertools

def plot_confusion_matrix(cm, classes,
                          normalize=False,
                          title='Confusion matrix',
                          cmap=plt.cm.Blues):
    """
    This function prints and plots the confusion matrix.
    Normalization can be applied by setting `normalize=True`.
    """
    if normalize:
        cm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]
        print("Normalized confusion matrix")
    else:
        print('Confusion matrix, without normalization')

    print(cm)

    plt.imshow(cm, interpolation='nearest', cmap=cmap)
    plt.title(title)
    plt.colorbar()
    tick_marks = np.arange(len(classes))
    plt.xticks(tick_marks, classes, rotation=45)
    plt.yticks(tick_marks, classes)

    fmt = '.2f' if normalize else 'd'
    thresh = cm.max() / 2.
    for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):
        plt.text(j, i, format(cm[i, j], fmt),
                 horizontalalignment="center",
                 color="white" if cm[i, j] > thresh else "black")

    plt.tight_layout()
    plt.ylabel('True label')
    plt.xlabel('Predicted label')
    
    
# Compute confusion matrix
cnf_matrix = confusion_matrix(y_test, yhat, labels=['PAIDOFF','COLLECTION'])
np.set_printoptions(precision=2)

print (classification_report(y_test, yhat))

# Plot non-normalized confusion matrix
plt.figure()
plot_confusion_matrix(cnf_matrix, classes=['PAIDOFF','COLLECTION'],normalize= False,  title='Confusion matrix')
              precision    recall  f1-score   support

  COLLECTION       0.20      0.07      0.10        15
     PAIDOFF       0.78      0.93      0.85        55

    accuracy                           0.74        70
   macro avg       0.49      0.50      0.48        70
weighted avg       0.66      0.74      0.69        70

Confusion matrix, without normalization
[[51  4]
 [14  1]]
In [44]:
from sklearn.metrics import f1_score
f1_score(y_test, yhat, average='weighted') 
Out[44]:
0.6892857142857144
In [45]:
from sklearn.metrics import jaccard_similarity_score
jaccard_similarity_score(y_test, yhat)
Out[45]:
0.7428571428571429

Logisitic Regression

In [46]:
import pylab as pl
import scipy.optimize as opt
In [47]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix
LR = LogisticRegression(C=0.01, solver='liblinear').fit(X_train,y_train)
LR
Out[47]:
LogisticRegression(C=0.01, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='warn', n_jobs=None, penalty='l2',
                   random_state=None, solver='liblinear', tol=0.0001, verbose=0,
                   warm_start=False)
In [48]:
yhat = LR.predict(X_test)
yhat
Out[48]:
array(['PAIDOFF', 'PAIDOFF', 'PAIDOFF', 'PAIDOFF', 'PAIDOFF', 'PAIDOFF',
       'PAIDOFF', 'PAIDOFF', 'PAIDOFF', 'PAIDOFF', 'PAIDOFF', 'PAIDOFF',
       'PAIDOFF', 'PAIDOFF', 'PAIDOFF', 'PAIDOFF', 'PAIDOFF', 'PAIDOFF',
       'PAIDOFF', 'PAIDOFF', 'PAIDOFF', 'PAIDOFF', 'PAIDOFF', 'PAIDOFF',
       'PAIDOFF', 'PAIDOFF', 'PAIDOFF', 'PAIDOFF', 'PAIDOFF', 'PAIDOFF',
       'PAIDOFF', 'PAIDOFF', 'PAIDOFF', 'PAIDOFF', 'PAIDOFF', 'PAIDOFF',
       'PAIDOFF', 'PAIDOFF', 'PAIDOFF', 'PAIDOFF', 'PAIDOFF', 'PAIDOFF',
       'PAIDOFF', 'PAIDOFF', 'PAIDOFF', 'PAIDOFF', 'PAIDOFF', 'PAIDOFF',
       'PAIDOFF', 'PAIDOFF', 'PAIDOFF', 'COLLECTION', 'PAIDOFF',
       'PAIDOFF', 'PAIDOFF', 'PAIDOFF', 'PAIDOFF', 'PAIDOFF', 'PAIDOFF',
       'PAIDOFF', 'PAIDOFF', 'PAIDOFF', 'PAIDOFF', 'PAIDOFF', 'PAIDOFF',
       'PAIDOFF', 'PAIDOFF', 'PAIDOFF', 'PAIDOFF', 'PAIDOFF'],
      dtype=object)
In [49]:
yhat_prob = LR.predict_proba(X_test)
yhat_prob
Out[49]:
array([[0.36, 0.64],
       [0.27, 0.73],
       [0.47, 0.53],
       [0.31, 0.69],
       [0.36, 0.64],
       [0.45, 0.55],
       [0.5 , 0.5 ],
       [0.36, 0.64],
       [0.45, 0.55],
       [0.4 , 0.6 ],
       [0.46, 0.54],
       [0.48, 0.52],
       [0.43, 0.57],
       [0.43, 0.57],
       [0.36, 0.64],
       [0.5 , 0.5 ],
       [0.45, 0.55],
       [0.46, 0.54],
       [0.3 , 0.7 ],
       [0.37, 0.63],
       [0.38, 0.62],
       [0.32, 0.68],
       [0.3 , 0.7 ],
       [0.27, 0.73],
       [0.46, 0.54],
       [0.45, 0.55],
       [0.31, 0.69],
       [0.3 , 0.7 ],
       [0.34, 0.66],
       [0.29, 0.71],
       [0.36, 0.64],
       [0.31, 0.69],
       [0.32, 0.68],
       [0.46, 0.54],
       [0.36, 0.64],
       [0.5 , 0.5 ],
       [0.34, 0.66],
       [0.32, 0.68],
       [0.43, 0.57],
       [0.47, 0.53],
       [0.47, 0.53],
       [0.47, 0.53],
       [0.43, 0.57],
       [0.3 , 0.7 ],
       [0.33, 0.67],
       [0.44, 0.56],
       [0.3 , 0.7 ],
       [0.36, 0.64],
       [0.36, 0.64],
       [0.47, 0.53],
       [0.3 , 0.7 ],
       [0.5 , 0.5 ],
       [0.32, 0.68],
       [0.45, 0.55],
       [0.36, 0.64],
       [0.44, 0.56],
       [0.35, 0.65],
       [0.47, 0.53],
       [0.37, 0.63],
       [0.4 , 0.6 ],
       [0.48, 0.52],
       [0.46, 0.54],
       [0.34, 0.66],
       [0.28, 0.72],
       [0.34, 0.66],
       [0.33, 0.67],
       [0.28, 0.72],
       [0.34, 0.66],
       [0.32, 0.68],
       [0.45, 0.55]])
In [50]:
from sklearn.metrics import jaccard_similarity_score
jaccard_similarity_score(y_test, yhat)
Out[50]:
0.8
In [51]:
from sklearn.metrics import log_loss
log_loss(y_test, yhat_prob)
Out[51]:
0.5317819704092389

Model Evaluation using Test set

In [52]:
from sklearn.metrics import jaccard_similarity_score
from sklearn.metrics import f1_score
from sklearn.metrics import log_loss

First, download and load the test set:

In [49]:
# !wget -O loan_test.csv https://s3-api.us-geo.objectstorage.softlayer.net/cf-courses-data/CognitiveClass/ML0101ENv3/labs/loan_test.csv

Load Test set for evaluation

In [53]:
test_df = pd.read_csv('C:\\Users\\Shaf\\Downloads\\loan_test.csv')
test_df.head()
test_df.columns

# Let's convert the data-time format of two columns - due_date and effective_date - to make it easier to work with later.
test_df['due_date'] = pd.to_datetime(test_df['due_date']) # convert datetime of due_data and overwrite
test_df['effective_date'] = pd.to_datetime(test_df['effective_date']) # convert datetime of effective_date and overwrite
test_df['dayofweek'] = test_df['effective_date'].dt.dayofweek
test_df['weekend'] = test_df['dayofweek'].apply(lambda x: 1 if (x>3)  else 0)
test_df.groupby(['Gender'])['loan_status'].value_counts(normalize=True)
test_df['Gender'].replace(to_replace=['male','female'], value=[0,1],inplace=True)
test_df.groupby(['education'])['loan_status'].value_counts(normalize=True)
Feature = test_df[['Principal','terms','age','Gender','weekend']]
Feature = pd.concat([Feature,pd.get_dummies(test_df['education'])], axis=1)
Feature.drop(['Master or Above'], axis = 1,inplace=True)
X = Feature
y = test_df['loan_status'].values
X = preprocessing.StandardScaler().fit(X).transform(X)

# Let's take a look at the first 5 columns of the modified dataset
test_df.head()
Out[53]:
Unnamed: 0 Unnamed: 0.1 loan_status Principal terms effective_date due_date age education Gender dayofweek weekend
0 1 1 PAIDOFF 1000 30 2016-09-08 2016-10-07 50 Bechalor 1 3 0
1 5 5 PAIDOFF 300 7 2016-09-09 2016-09-15 35 Master or Above 0 4 1
2 21 21 PAIDOFF 1000 30 2016-09-10 2016-10-09 43 High School or Below 1 5 1
3 24 24 PAIDOFF 1000 30 2016-09-10 2016-10-09 26 college 0 5 1
4 35 35 PAIDOFF 800 15 2016-09-11 2016-09-25 29 Bechalor 0 6 1
In [54]:
Feature.columns
Out[54]:
Index(['Principal', 'terms', 'age', 'Gender', 'weekend', 'Bechalor',
       'High School or Below', 'college'],
      dtype='object')
In [55]:
yhat_prob = LR.predict_proba(X)
# yhat_prob
In [56]:
models = [neigh, loanTree, clf, LR]
modelnames = ['KNN', 'Decision Tree', 'SVM', 'Logistic Regression']
data = []
for model in models:
    if model != LR:
        yhat = model.predict(X)
        Jaccard = jaccard_similarity_score(y,yhat)
        F1 = f1_score(y, yhat, average='weighted')
        data.append([Jaccard, F1, "NaN"])
    else:
        yhat = model.predict(X)
        Jaccard = jaccard_similarity_score(y,yhat)
        F1 = f1_score(y, yhat, average='weighted')
        LogLoss = log_loss(y, yhat_prob)
        data.append([Jaccard, F1, LogLoss])
final_predictions_df_new = pd.DataFrame(data, index = modelnames, columns=['Jaccard', 'F1-score','LogLoss'])
In [57]:
final_predictions_df_new
Out[57]:
Jaccard F1-score LogLoss
KNN 0.648148 0.633396 NaN
Decision Tree 0.722222 0.718793 NaN
SVM 0.703704 0.637860 NaN
Logistic Regression 0.740741 0.630418 0.571456

Report

You should be able to report the accuracy of the built model using different evaluation metrics:

Algorithm Jaccard F1-score LogLoss
KNN ? ? NA
Decision Tree ? ? NA
SVM ? ? NA
LogisticRegression ? ? ?

We were able to output the results expected in the format expected. However, the work doesn't end here. Ideally, I would have improved the models by modifying the preprocessing steps.

Specifically, I would have first looked at the intercorrelations of the features to inspect if all included variables provide unique predictive value. Those that have high shared variances would have been dropped. I would test multiple Ks and Tree depths to investigate the optimal settings to improve learning. I would have implemented a cross-validation approach instead of a test/train split given the small sample size of our data. I would have used other approaches than a standard scalar to normalize data after checking for normal distributions. I would have ensured that the data has equal class sizes to minimize the bias in predictions. I would have created dummy variables for longer terms and higher principals based on the visualizations.

I will leave that work for a later time, hopefully, on a larger dataset.

In [ ]: