Use Random Forest model, sklearn, python and the Kyphosis dataset to predict if the Kyphosis would return after surgery.

About:

This project / case study is for phase 1 of my 100 days of machine learning code challenge.

This is a homework solution to a section in Machine Learning Classification Bootcamp in Python.

Problem Statement:

Predict if Kyphosis will return to patient after corrective spinal surgery

Technology used:

Model(s):

Random Forest

Dataset(s):

Kyphosis Dataset

Libraries:

Resources:

Contact:

If for any reason you would like to contact me please do so at the following:

Import Data and libraries¶

In [1]:

# Import Libraries
import pandas as pd
# import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

In [2]:

# Import data
kyphosis_df = pd.read_csv('../datasets/kyphosis/kyphosis.csv')

In [3]:

kyphosis_df.head()

Out[3]:

	Kyphosis	Age	Number	Start
0	absent	71	3	5
1	absent	158	3	14
2	present	128	4	5
3	absent	2	5	1
4	absent	1	4	15

In [4]:

kyphosis_df.tail()

Out[4]:

	Kyphosis	Age	Number	Start
76	present	157	3	13
77	absent	26	7	13
78	absent	120	2	13
79	present	42	7	6
80	absent	36	4	13

In [5]:

kyphosis_df.shape

Out[5]:

(81, 4)

Explore Dataset¶

In [6]:

# Age in Months
kyphosis_df.describe()

Out[6]:

	Age	Number	Start
count	81.000000	81.000000	81.000000
mean	83.654321	4.049383	11.493827
std	58.104251	1.619423	4.883962
min	1.000000	2.000000	1.000000
25%	26.000000	3.000000	9.000000
50%	87.000000	4.000000	13.000000
75%	130.000000	5.000000	16.000000
max	206.000000	10.000000	18.000000

In [7]:

sns.countplot(kyphosis_df['Kyphosis'], label ='Count')

Out[7]:

<matplotlib.axes._subplots.AxesSubplot at 0x1a0caa9438>

In [8]:

from sklearn.preprocessing import LabelEncoder, OneHotEncoder

In [9]:

# change Kyphosis Column into number output
LabelEncoder_y = LabelEncoder()
kyphosis_df['Kyphosis'] = LabelEncoder_y.fit_transform(kyphosis_df['Kyphosis'])

In [10]:

kyphosis_df.head()

Out[10]:

	Kyphosis	Age	Number	Start
0	0	71	3	5
1	0	158	3	14
2	1	128	4	5
3	0	2	5	1
4	0	1	4	15

In [11]:

kyphosis_true = kyphosis_df[kyphosis_df['Kyphosis']==1]

In [12]:

kyphosis_true.head()

Out[12]:

	Kyphosis	Age	Number	Start
2	1	128	4	5
9	1	59	6	12
10	1	82	5	14
21	1	105	6	5
22	1	96	3	12

In [13]:

# 0 is usually equal to false
kyphosis_false = kyphosis_df[kyphosis_df['Kyphosis']==0]

In [14]:

kyphosis_false.head()

Out[14]:

	Age	Number	Start
0	71	3	5
1	158	3	14
3	2	5	1
4	1	4	15
5	1	2	16

In [15]:

print('Disease present after operation percentage is',
      (len(kyphosis_true)/len(kyphosis_df))*100, '%')

Disease present after operation percentage is 20.98765432098765 %

In [16]:

print('Disease not present after operation percentage is',
      (len(kyphosis_false)/len(kyphosis_df))*100, '%')

Disease not present after operation percentage is 79.01234567901234 %

In [17]:

sns.heatmap(kyphosis_df.corr(), annot = True)

Out[17]:

<matplotlib.axes._subplots.AxesSubplot at 0x1a151f24e0>

In [18]:

sns.pairplot(kyphosis_df, hue='Kyphosis',
             vars = ['Age', 'Number', 'Start'])

Out[18]:

<seaborn.axisgrid.PairGrid at 0x1a152e69b0>

Data Prep¶

In [19]:

kyphosis_df.head()

Out[19]:

	Kyphosis	Age	Number	Start
0	0	71	3	5
1	0	158	3	14
2	1	128	4	5
3	0	2	5	1
4	0	1	4	15

In [20]:

X = kyphosis_df.drop(['Kyphosis'], axis =1)

In [21]:

X.head()

Out[21]:

	Age	Number	Start
0	71	3	5
1	158	3	14
2	128	4	5
3	2	5	1
4	1	4	15

In [22]:

y = kyphosis_df['Kyphosis']

In [23]:

y.head()

Out[23]:

0    0
1    0
2    1
3    0
4    0
Name: Kyphosis, dtype: int64

In [24]:

# Create train Test Split
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X,
                                                    y,
                                                    test_size=0.20,
                                                    random_state = 2,
                                                    stratify=y)

Train Model¶

In [25]:

# Decision tree
from sklearn.tree import DecisionTreeClassifier
decision_tree = DecisionTreeClassifier()
decision_tree.fit(X_train, y_train)

Out[25]:

DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
            max_features=None, max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, presort=False, random_state=None,
            splitter='best')

In [26]:

# Find feature importance
feature_importance = pd.DataFrame(decision_tree.feature_importances_,
                                  index = X_train.columns,
                                  columns = ['importance'] )

In [27]:

feature_importance

Out[27]:

	importance
Age	0.476436
Number	0.254160
Start	0.269404

In [28]:

# Sorted Feature Importance
feature_importance = pd.DataFrame(
    decision_tree.feature_importances_,
    index = X_train.columns,
    columns = ['importance']).sort_values('importance',
    ascending = False)

In [29]:

feature_importance

Out[29]:

	importance
Age	0.476436
Start	0.269404
Number	0.254160

In [30]:

from sklearn.metrics import confusion_matrix, classification_report

In [31]:

y_predict = decision_tree.predict(X_test)

In [32]:

cm = confusion_matrix(y_test, y_predict)

In [33]:

sns.heatmap(cm, annot =True)

Out[33]:

<matplotlib.axes._subplots.AxesSubplot at 0x1a15d50898>

In [34]:

print(classification_report(y_test, y_predict))

             precision    recall  f1-score   support

          0       0.85      0.85      0.85        13
          1       0.50      0.50      0.50         4

avg / total       0.76      0.76      0.76        17

In [35]:

from sklearn.ensemble import RandomForestClassifier

In [56]:

randomforest_classifier = RandomForestClassifier(
                            n_estimators = 500,
                            criterion ='entropy')

In [62]:

randomforest_classifier.fit(X_train,y_train)

Out[62]:

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='entropy',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=500, n_jobs=1,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False)

In [63]:

y_predict_forest = randomforest_classifier.predict(X_test)

In [64]:

cm = confusion_matrix(y_test, y_predict_forest)

In [65]:

sns.heatmap(cm, annot = True)

Out[65]:

<matplotlib.axes._subplots.AxesSubplot at 0x1a162d40f0>

In [66]:

print(classification_report(y_test, y_predict_forest))

             precision    recall  f1-score   support

          0       1.00      0.92      0.96        13
          1       0.80      1.00      0.89         4

avg / total       0.95      0.94      0.94        17