- Decision Tree Modelling Introduction
- Decision Tree Python Algorithm
- Decision Tree Workflow
- Decision Tree 4 Rows
- Decision Tree – Only Six Rows
- Decision Trees and Random Forests
Tree-based machine learning is a type of supervised machine learning that performs classification and regression tasks. It uses a decision tree as a predictive model to go from observations about an item represented by the branches to conclusions about the item’s target value represented in the nodes.
A Simple Example
We have a series of posts that discuss a simple example of building a decision tree in Python. The series starts with a post called Simple Decision Tree in Python. This example has a small dataset of fruits.
Decision trees are like asking a set of if-then-else questions that provide branching for classification. Suppose I am going out and I am wondering if I should take an umbrella. If it’s not raining I decide not to take an umbrella. If it is raining I ask myself How far I need to go and if it’s only a short distance I decide not to take my umbrella and get just a bit wet, otherwise I take my umbrella.
Decision trees are like flow charts. Decision trees uses branching paths to predict the outcomes of events, the probability of certain outcomes, or to reach a decision. They can be used for classification problems, where a specific class or outcome is predicted. They can also be used for regression problems, where a continuous variable is predicted—like the price of a house.
Of course decision trees only look like trees when they are flipped upside down. They start with the root at the top and grow downward so the “leaves” are at the bottom. Decision trees are made of nodes which are groups of samples (data). There are different types of nodes with the first node (top) node being a root node. The first split always comes off of the root node, which divides the samples into two new nodes based on the values they contain for a particular feature. If the answer is “Yes”, the split typically follows to the left.
The bottom-level nodes that do not split are called leaf nodes. All the nodes above the leaf nodes are called decision nodes, because they all make a decision that sorts the data either to the left or to the right.
Decision trees require no assumptions regarding the distribution of underlying data. Unlike some other models, they can handle collinearity easily. Additionally, preparing data to train a decision tree can be a much less complex process requiring little preprocessing if any at all. Decision trees can be susceptible to overfitting. The model might get extremely good at predicting training data, but as soon as new data is introduced, it may not work nearly as well.
Decision trees are also not sensitive to outliers since the partitioning happens based on the proportion of samples within the split ranges and not on absolute values.
The root node is the first (top with no predecessors) node in the tree. The nodes where a decision is made are decision nodes. Decision nodes always point to a leaf node or other decision nodes within the tree. Leaf nodes are where a final prediction is made. The whole process ends here at the leaf node. When the tree is drawn, “yes” always goes to the left and ‘no” to the right.
These two new nodes off the root node are referred to as child nodes of the root. A child node is any node that results from a split. The node that the child splits from is known as the parent node.
Categorical or Continuous?
If the predictor variable is categorical, the decision tree algorithm will consider splitting based on category. A predictor variable is an X variable. It is a variable that represents a characteristic of the items. In the fruit example, a predictor variable is color. If the predictor variable is continuous, splits can be made anywhere along the range of numbers that exist in the data. Often the potential split points are determined by sorting the values for the feature and taking the mean of each consecutive pair of values. However, there can be any number of split points, and fewer split points can be considered to save computational resources and time.
Decision Splits and Impurity
With a decision tree, data is split each time it hits a decision node. The term impurity refers to the amount of mixture. Nodes with low impurity have many more of one class than any other. A perfect split would have no impurity in the resulting child nodes; it would partition the data with each child containing only a single class. The worst possible split would have high impurity in the resulting child nodes; both of the child nodes would have equal numbers of each class.
An algorithm is used to calculate the “purity” of the child nodes that would result from each split point of each variable. The feature and split point that generate the purest child nodes are selected to partition the data.
Gini Impurity
The decision tree algorithm determines the split that will result in the lowest impurity among the child nodes by performing a calculation. There are several possible metrics to use to determine the purity of a node and to decide how to split, including Gini impurity, entropy, information gain, and log loss. The most straightforward is Gini impurity, and it’s also the default for the decision tree classifier in scikit-learn. The algorithm looks for low Gini impurity scores when it is growing the tree.