This is a very simple example of building a decision tree model on a very small dataset that has only six rows of data. The dataset only has 6 rows of data. With so few rows of data we can easily see the tree and understand how the tree was built. Let’s dive right inn to the Python code we would use. To make things easier, I have provided the data right in the code itself, opposed to importing the data from an external file.
import pandas as pd import matplotlib.pyplot as plt # from sklearn.model_selection import train_test_split from sklearn.tree import DecisionTreeClassifier # This function displays the splits of the tree from sklearn.tree import plot_tree
# to manually create a DataFrame, start with a dictionary of equal-length lists. # The data has been slightly modified from the Udemy course data = {"J": [1, 1, 0, 1, 0, 1], "K": [1, 1, 0, 0, 1, 1], "L": [1, 0, 1, 0, 1, 0], "Class": ['A', 'A', 'B', 'B', 'A', 'B']} df = pd.DataFrame(data) df
- We are trying to predict the class. Notice that the K feature is the best feature for predicting.
- When K = 1 it is class A three out of four times, which seems to the best predictor.
- When L = 1 it is A two out of three times, which is not a good predictor.
- When J = 1 it is class A half of the time, making J a poor predictor
- So, in order they are K, L and J.
# Define the y (target) variable y = df['Class'] # Define the X (predictor) variables X = df.copy() X = X.drop('Class', axis=1) X
y
Instantiate the Model
# Instantiate the model decision_tree = DecisionTreeClassifier(random_state=42) # We will use ALL of the date to train the model. # Fit the model to training data decision_tree.fit(X, y)
print(X.columns) print('\n') X.info()
cn = list(set(df['Class'])) cn
# Plot the tree plt.figure(figsize=(6,4)) plot_tree(decision_tree, max_depth=4, fontsize=9, feature_names=['J', 'K', 'L'], class_names=cn, filled=True); plt.show()
Test Dataset One – Perfect K
# to manually create a DataFrame, start with a dictionary of equal-length lists. # The data has been slightly modified from the Udemy course data_test = {"J": [1, 0, 0], "K": [1, 1, 0], "L": [1, 0, 1], "Class": ['A', 'A', 'B']} df_test = pd.DataFrame(data_test) df_test
# Define the y (target) variable y_test = df_test['Class'] # Define the X (predictor) variables X_test = df_test.copy() X_test = X_test.drop('Class', axis=1) X_test
predictions = decision_tree.predict(X_test) # from sklearn.metrics import classification_report, confusion_matrix print(confusion_matrix(y_test, predictions)) print('\n') print(classification_report(y_test, predictions))
Test Dataset Two – Imperfect K
# to manually create a DataFrame, start with a dictionary of equal-length lists. # The data has been slightly modified from the Udemy course data_test2 = {"J": [1, 0, 0], "K": [1, 0, 0], "L": [1, 0, 1], "Class": ['A', 'A', 'B']} df_test2 = pd.DataFrame(data_test2) df_test2
# Define the y (target) variable y_test2 = df_test2['Class'] # Define the X (predictor) variables X_test2 = df_test2.copy() X_test2 = X_test2.drop('Class', axis=1) X_test2
predictions2 = decision_tree.predict(X_test2) print(confusion_matrix(y_test2, predictions2)) print('\n') print(classification_report(y_test2, predictions2))
[[1 1] [0 1]] precision recall f1-score support A 1.00 0.50 0.67 2 B 0.50 1.00 0.67 1 accuracy 0.67 3 macro avg 0.75 0.75 0.67 3 weighted avg 0.83 0.67 0.67 3