This work is submitted for the final project of ML course on Cognitive Ai. The goal of this project is to demonstrate the usage of various classification algorithms and to generate a table with metrics for all the classification models used. A detailed description of the dataset used, alongside the entire pipeline can be found further below.
First, I show a snap-shot of the code necessary to derive the output without any comments, print statements, or other visualizations. Later, you can see the entire code explained in more detail.
Please use the toggle On/Off to see or hide code as you like.
from IPython.display import HTML
HTML('''<script>
code_show=true;
function code_toggle() {
if (code_show){
$('div.input').hide();
} else {
$('div.input').show();
}
code_show = !code_show
}
$( document ).ready(code_toggle);
</script>
<form action="javascript:code_toggle()"><input type="submit" value="Click here to toggle on/off the raw code."></form>''')
import itertools
import numpy as np
import pylab as pl
import pandas as pd
import scipy.optimize as opt
import matplotlib.pyplot as plt
from matplotlib.ticker import NullFormatter
import matplotlib.ticker as ticker
from sklearn import preprocessing
from sklearn.tree import DecisionTreeClassifier
from sklearn import svm
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix
from sklearn.metrics import jaccard_similarity_score
from sklearn.metrics import f1_score
from sklearn.metrics import log_loss
from sklearn.neighbors import KNeighborsClassifier
%matplotlib inline
import warnings
warnings.filterwarnings("ignore")
train_file_loc = 'C:\\Users\\Shaf\\Downloads\\loan_train.csv'
test_file_loc = 'C:\\Users\\Shaf\\Downloads\\loan_test.csv'
def best_classification_model (train_file, test_file):
loans_df = pd.read_csv(train_file_loc)
test_df = pd.read_csv(test_file_loc)
k = 2
loans_df['due_date'] = pd.to_datetime(loans_df['due_date'])
loans_df['effective_date'] = pd.to_datetime(loans_df['effective_date'])
loans_df['dayofweek'] = loans_df['effective_date'].dt.dayofweek
loans_df['weekend'] = loans_df['dayofweek'].apply(lambda x: 1 if (x>3) else 0)
loans_df.groupby(['Gender'])['loan_status'].value_counts(normalize=True)
loans_df['Gender'].replace(to_replace=['male','female'], value=[0,1],inplace=True)
loans_df.groupby(['education'])['loan_status'].value_counts(normalize=True)
Feature_train = loans_df[['Principal','terms','age','Gender','weekend']]
Feature_train = pd.concat([Feature_train,pd.get_dummies(loans_df['education'])], axis=1)
Feature_train.drop(['Master or Above'], axis = 1,inplace=True)
X_train = Feature_train
y_train = loans_df['loan_status'].values
X_train = preprocessing.StandardScaler().fit(X_train).transform(X_train)
test_df['due_date'] = pd.to_datetime(test_df['due_date'])
test_df['effective_date'] = pd.to_datetime(test_df['effective_date'])
test_df['dayofweek'] = test_df['effective_date'].dt.dayofweek
test_df['weekend'] = test_df['dayofweek'].apply(lambda x: 1 if (x>3) else 0)
test_df.groupby(['Gender'])['loan_status'].value_counts(normalize=True)
test_df['Gender'].replace(to_replace=['male','female'], value=[0,1],inplace=True)
test_df.groupby(['education'])['loan_status'].value_counts(normalize=True)
Feature_test = test_df[['Principal','terms','age','Gender','weekend']]
Feature_test = pd.concat([Feature_test,pd.get_dummies(test_df['education'])], axis=1)
Feature_test.drop(['Master or Above'], axis = 1,inplace=True)
X_test = Feature_test
y_test = test_df['loan_status'].values
X_test = preprocessing.StandardScaler().fit(X_test).transform(X_test)
neigh = KNeighborsClassifier(n_neighbors = k).fit(X_train,y_train)
loanTree = DecisionTreeClassifier(criterion="entropy", max_depth = 4).fit(X_train,y_train)
clf = svm.SVC(kernel='rbf').fit(X_train, y_train)
LR = LogisticRegression(C=0.01, solver='liblinear').fit(X_train,y_train)
models = [neigh, loanTree, clf, LR]
modelnames = ['KNN', 'Decision Tree', 'SVM', 'Logistic Regression']
data = []
for model in models:
if model != LR:
yhat = model.predict(X_test)
Jaccard = jaccard_similarity_score(y_test,yhat)
F1 = f1_score(y_test, yhat, average='weighted')
data.append([Jaccard, F1, "NaN"])
else:
yhat = model.predict(X_test)
yhat_prob = LR.predict_proba(X_test)
Jaccard = jaccard_similarity_score(y_test,yhat)
F1 = f1_score(y_test, yhat, average='weighted')
LogLoss = log_loss(y_test, yhat_prob)
data.append([Jaccard, F1, LogLoss])
final_predictions_df_new = pd.DataFrame(data, index = modelnames, columns=['Jaccard', 'F1_score','LogLoss'])
max_Jaccard = final_predictions_df_new.Jaccard.argmax()
max_F1 = final_predictions_df_new.F1_score.argmax()
print(final_predictions_df_new, f'Model with the best Jaccard Score is {max_Jaccard} and Model with highest F1-score is {max_F1}.', sep='\n')
best_classification_model (train_file_loc, test_file_loc)
# Let's import the basic dependencies required to visualize the data and to preprocess it.
import itertools
import numpy as np
import matplotlib.pyplot as plt
from matplotlib.ticker import NullFormatter
import pandas as pd
import numpy as np
import matplotlib.ticker as ticker
from sklearn import preprocessing
%matplotlib inline
This dataset is about past loans. The Loan_train.csv data set includes details of 346 customers whose loan are already paid off or defaulted. It includes following fields:
| Field | Description |
|---|---|
| Loan_status | Whether a loan is paid off on in collection |
| Principal | Basic principal loan amount at the |
| Terms | Origination terms which can be weekly (7 days), biweekly, and monthly payoff schedule |
| Effective_date | When the loan got originated and took effects |
| Due_date | Since it’s one-time payoff schedule, each loan has one single due date |
| Age | Age of applicant |
| Education | Education of applicant |
| Gender | The gender of applicant |
Lets download the dataset.
# If you do not have access to the dataset - find it at the link above
loans_df = pd.read_csv('C:\\Users\\Shaf\\Downloads\\loan_train.csv')
# Let's inspect the first 5 rows of the data.
loans_df.head()
# Let's inspect the shape of the data
loans_df.shape
Preprocessing of the data was done by the course instructor. I will keep it as is, for now. I might do some more processing later to improve the prediction accuracies as needed.
# Let's convert the data-time format of two columns - due_date and effective_date - to make it easier to work with later.
loans_df['due_date'] = pd.to_datetime(loans_df['due_date']) # convert datetime of due_data and overwrite
loans_df['effective_date'] = pd.to_datetime(loans_df['effective_date']) # convert datetime of effective_date and overwrite
# Let's take a look at the first 5 columns of the modified dataset
loans_df.head()
Our target variable, 'loan_status', which we intend to predict has two classes.
Let’s look at the counts of each class to see if the classes are equally distributed.
# counts of each unique value in the column loan_status
loans_df['loan_status'].value_counts()
260 people have paid off the loan on time while 86 have gone into collection. Unfortunately, the data isn't distributed equally between the two classes.
Let's inspect the data using some visualizations to understand the distributions better.
# Let's import seaborn and visualize the distribution of the principal loan amounts by gender
import seaborn as sns
bins = np.linspace(loans_df.Principal.min(), loans_df.Principal.max(), 10)
g = sns.FacetGrid(loans_df, col="Gender", hue="loan_status", palette="Set1", col_wrap=2)
g.map(plt.hist, 'Principal', bins=bins, ec="k")
g.axes[-1].legend()
plt.show()
# Let's visualize the age of the borrower by gender
bins = np.linspace(loans_df.age.min(), loans_df.age.max(), 10)
g = sns.FacetGrid(loans_df, col="Gender", hue="loan_status", palette="Set1", col_wrap=2)
g.map(plt.hist, 'age', bins=bins, ec="k")
g.axes[-1].legend()
plt.show()
# Let's visualize the day of the week when an individual gets a loan
# Let's first create the day of the week using effective_day column
loans_df['dayofweek'] = loans_df['effective_date'].dt.dayofweek
# Now let's create bins for each of the days (0-6, total 7, 1 for each day of the week)
bins = np.linspace(loans_df.dayofweek.min(), loans_df.dayofweek.max(), 10)
g = sns.FacetGrid(loans_df, col="Gender", hue="loan_status", palette="Set1", col_wrap=2)
g.map(plt.hist, 'dayofweek', bins=bins, ec="k")
g.axes[-1].legend()
plt.show()
We can use PairGrid method in seaborn to visualize the entire dataset instead. We can chose to look at bar plots, histograms, scatterplots...etc.
# Let's look at a pair grid plot to see the distributions of all variables in our dataframe.
g = sns.PairGrid(loans_df)
g = g.map(plt.bar)
g = sns.PairGrid(loans_df)
g = g.map(plt.scatter)
g = sns.PairGrid(loans_df)
g = g.map_diag(plt.hist)
g = g.map_offdiag(plt.scatter)
g = sns.PairGrid(loans_df, hue="loan_status")
g = g.map_diag(plt.hist)
g = g.map_offdiag(plt.scatter)
g = g.add_legend()
We see that people who get the loan at the end of the week dont pay it off, so lets use Feature binarization to set a threshold values less then day 4
loans_df['weekend'] = loans_df['dayofweek'].apply(lambda x: 1 if (x>3) else 0)
loans_df.head()
Lets look at gender:
loans_df.groupby(['Gender'])['loan_status'].value_counts(normalize=True)
86 % of female pay there loans while only 73 % of males pay there loan
Lets convert male to 0 and female to 1:
loans_df['Gender'].replace(to_replace=['male','female'], value=[0,1],inplace=True)
loans_df.head()
loans_df.groupby(['education'])['loan_status'].value_counts(normalize=True)
loans_df[['Principal','terms','age','Gender','education']].head()
Feature = loans_df[['Principal','terms','age','Gender','weekend']]
Feature = pd.concat([Feature,pd.get_dummies(loans_df['education'])], axis=1)
Feature.drop(['Master or Above'], axis = 1,inplace=True)
Feature.head()
Lets defind feature sets, X:
X = Feature
X[0:5]
What are our lables?
y = loans_df['loan_status'].values
y[0:5]
Data Standardization give data zero mean and unit variance (technically should be done after train test split )
X= preprocessing.StandardScaler().fit(X).transform(X)
X[0:5]
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
print (X_train.shape, y_train.shape)
print (X_test.shape, y_test.shape)
Now, it is your turn, use the training set to build an accurate model. Then use the test set to report the accuracy of the model You should use the following algorithm:
Notice:
from sklearn.neighbors import KNeighborsClassifier
# since we have two groups of interest (two classes), let's set out k to 2.
k = 2
#Train Model and Predict
neigh = KNeighborsClassifier(n_neighbors = k).fit(X_train,y_train)
neigh
yhat = neigh.predict(X_test)
yhat[0:5]
from sklearn import metrics
print("Train set Accuracy: ", metrics.accuracy_score(y_train, neigh.predict(X_train)))
print("Test set Accuracy: ", metrics.accuracy_score(y_test, yhat))
#for a good measure, let's inspect what happens when we use variable ks.
Ks = 10
mean_acc = np.zeros((Ks-1))
std_acc = np.zeros((Ks-1))
ConfustionMx = [];
for n in range(1,Ks):
#Train Model and Predict
neigh = KNeighborsClassifier(n_neighbors = n).fit(X_train,y_train)
yhat=neigh.predict(X_test)
mean_acc[n-1] = metrics.accuracy_score(y_test, yhat)
std_acc[n-1]=np.std(yhat==y_test)/np.sqrt(yhat.shape[0])
mean_acc
plt.plot(range(1,Ks),mean_acc,'g')
plt.fill_between(range(1,Ks),mean_acc - 1 * std_acc,mean_acc + 1 * std_acc, alpha=0.10)
plt.legend(('Accuracy ', '+/- 3xstd'))
plt.ylabel('Accuracy ')
plt.xlabel('Number of Neighbors (K)')
plt.tight_layout()
plt.show()
from sklearn.tree import DecisionTreeClassifier
loanTree = DecisionTreeClassifier(criterion="entropy", max_depth = 4)
loanTree # it shows the default parameters
loanTree.fit(X_train,y_train)
predLoanTree = loanTree.predict(X_test)
print (predLoanTree [0:5])
print (y_test [0:5])
from sklearn import metrics
import matplotlib.pyplot as plt
print("DecisionTrees's Accuracy: ", metrics.accuracy_score(y_test, predLoanTree))
from sklearn.externals.six import StringIO
import pydotplus
import matplotlib.image as mpimg
from sklearn import tree
%matplotlib inline
dot_data = StringIO()
filename = "drugtree.png"
featureNames = Feature.columns[0:8]
targetNames = loans_df['loan_status'].unique().tolist()
out=tree.export_graphviz(loanTree,feature_names=featureNames, out_file=dot_data, class_names= np.unique(y_train), filled=True, special_characters=True,rotate=False)
graph = pydotplus.graph_from_dot_data(dot_data.getvalue())
graph.write_png(filename)
img = mpimg.imread(filename)
plt.figure(figsize=(100, 200))
plt.imshow(img,interpolation='nearest')
from sklearn import svm
clf = svm.SVC(kernel='rbf')
clf.fit(X_train, y_train)
yhat = clf.predict(X_test)
yhat [0:5]
from sklearn.metrics import classification_report, confusion_matrix
import itertools
def plot_confusion_matrix(cm, classes,
normalize=False,
title='Confusion matrix',
cmap=plt.cm.Blues):
"""
This function prints and plots the confusion matrix.
Normalization can be applied by setting `normalize=True`.
"""
if normalize:
cm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]
print("Normalized confusion matrix")
else:
print('Confusion matrix, without normalization')
print(cm)
plt.imshow(cm, interpolation='nearest', cmap=cmap)
plt.title(title)
plt.colorbar()
tick_marks = np.arange(len(classes))
plt.xticks(tick_marks, classes, rotation=45)
plt.yticks(tick_marks, classes)
fmt = '.2f' if normalize else 'd'
thresh = cm.max() / 2.
for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):
plt.text(j, i, format(cm[i, j], fmt),
horizontalalignment="center",
color="white" if cm[i, j] > thresh else "black")
plt.tight_layout()
plt.ylabel('True label')
plt.xlabel('Predicted label')
# Compute confusion matrix
cnf_matrix = confusion_matrix(y_test, yhat, labels=['PAIDOFF','COLLECTION'])
np.set_printoptions(precision=2)
print (classification_report(y_test, yhat))
# Plot non-normalized confusion matrix
plt.figure()
plot_confusion_matrix(cnf_matrix, classes=['PAIDOFF','COLLECTION'],normalize= False, title='Confusion matrix')
from sklearn.metrics import f1_score
f1_score(y_test, yhat, average='weighted')
from sklearn.metrics import jaccard_similarity_score
jaccard_similarity_score(y_test, yhat)
import pylab as pl
import scipy.optimize as opt
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix
LR = LogisticRegression(C=0.01, solver='liblinear').fit(X_train,y_train)
LR
yhat = LR.predict(X_test)
yhat
yhat_prob = LR.predict_proba(X_test)
yhat_prob
from sklearn.metrics import jaccard_similarity_score
jaccard_similarity_score(y_test, yhat)
from sklearn.metrics import log_loss
log_loss(y_test, yhat_prob)
from sklearn.metrics import jaccard_similarity_score
from sklearn.metrics import f1_score
from sklearn.metrics import log_loss
First, download and load the test set:
# !wget -O loan_test.csv https://s3-api.us-geo.objectstorage.softlayer.net/cf-courses-data/CognitiveClass/ML0101ENv3/labs/loan_test.csv
test_df = pd.read_csv('C:\\Users\\Shaf\\Downloads\\loan_test.csv')
test_df.head()
test_df.columns
# Let's convert the data-time format of two columns - due_date and effective_date - to make it easier to work with later.
test_df['due_date'] = pd.to_datetime(test_df['due_date']) # convert datetime of due_data and overwrite
test_df['effective_date'] = pd.to_datetime(test_df['effective_date']) # convert datetime of effective_date and overwrite
test_df['dayofweek'] = test_df['effective_date'].dt.dayofweek
test_df['weekend'] = test_df['dayofweek'].apply(lambda x: 1 if (x>3) else 0)
test_df.groupby(['Gender'])['loan_status'].value_counts(normalize=True)
test_df['Gender'].replace(to_replace=['male','female'], value=[0,1],inplace=True)
test_df.groupby(['education'])['loan_status'].value_counts(normalize=True)
Feature = test_df[['Principal','terms','age','Gender','weekend']]
Feature = pd.concat([Feature,pd.get_dummies(test_df['education'])], axis=1)
Feature.drop(['Master or Above'], axis = 1,inplace=True)
X = Feature
y = test_df['loan_status'].values
X = preprocessing.StandardScaler().fit(X).transform(X)
# Let's take a look at the first 5 columns of the modified dataset
test_df.head()
Feature.columns
yhat_prob = LR.predict_proba(X)
# yhat_prob
models = [neigh, loanTree, clf, LR]
modelnames = ['KNN', 'Decision Tree', 'SVM', 'Logistic Regression']
data = []
for model in models:
if model != LR:
yhat = model.predict(X)
Jaccard = jaccard_similarity_score(y,yhat)
F1 = f1_score(y, yhat, average='weighted')
data.append([Jaccard, F1, "NaN"])
else:
yhat = model.predict(X)
Jaccard = jaccard_similarity_score(y,yhat)
F1 = f1_score(y, yhat, average='weighted')
LogLoss = log_loss(y, yhat_prob)
data.append([Jaccard, F1, LogLoss])
final_predictions_df_new = pd.DataFrame(data, index = modelnames, columns=['Jaccard', 'F1-score','LogLoss'])
final_predictions_df_new
You should be able to report the accuracy of the built model using different evaluation metrics:
| Algorithm | Jaccard | F1-score | LogLoss |
|---|---|---|---|
| KNN | ? | ? | NA |
| Decision Tree | ? | ? | NA |
| SVM | ? | ? | NA |
| LogisticRegression | ? | ? | ? |
Specifically, I would have first looked at the intercorrelations of the features to inspect if all included variables provide unique predictive value. Those that have high shared variances would have been dropped. I would test multiple Ks and Tree depths to investigate the optimal settings to improve learning. I would have implemented a cross-validation approach instead of a test/train split given the small sample size of our data. I would have used other approaches than a standard scalar to normalize data after checking for normal distributions. I would have ensured that the data has equal class sizes to minimize the bias in predictions. I would have created dummy variables for longer terms and higher principals based on the visualizations.
I will leave that work for a later time, hopefully, on a larger dataset.