Scikit-learn is a Python module that is used in Machine learning implementations. It is distributed under BSD 3-clause and built on top of SciPy. The implementation of Python ensures a consistent interface and provides robust machine learning and statistical modeling tools like regression, SciPy, NumPy, etc. These tools are the foundations of the SkLearn package and are mostly built using Python.

In this article, we will learn all about Sklearn Decision Trees.

Become a Data Scientist with Hands-on Training!

Data Scientist Master’s ProgramExplore Program
Become a Data Scientist with Hands-on Training!

Sklearn Decision Trees

Before getting into the details of implementing a decision tree, let us understand classifiers and decision trees.

Classifiers

A classifier algorithm can be used to anticipate and understand what qualities are connected with a given class or target by mapping input data to a target variable using decision rules. In this supervised machine learning technique, we already have the final labels and are only interested in how they might be predicted. Based on variables such as Sepal Width, Petal Length, Sepal Length, and Petal Width, we may use the Decision Tree Classifier to estimate the sort of iris flower we have.

Decision Tree

A decision tree is a decision model and all of the possible outcomes that decision trees might hold. This might include the utility, outcomes, and input costs, that uses a flowchart-like tree structure.

The decision-tree algorithm is classified as a supervised learning algorithm. It can be used with both continuous and categorical output variables.

The node's result is represented by the branches/edges, and either of the following are contained in the nodes:

  • [Decision Nodes] Conditions
  • [End Nodes] Result

Now that we understand what classifiers and decision trees are, let us look at SkLearn Decision Tree Regression.

Decision Tree Regression

Decision tree regression examines an object's characteristics and trains a model in the shape of a tree to forecast future data and create meaningful continuous output. The output/result is not discrete because it is not represented solely by a known set of discrete values.

Example of a discrete output - A cricket-match prediction model that determines whether a particular team wins or not.

Example of continuous output - A sales forecasting model that predicts the profit margins that a company would gain over a financial year based on past values.

In this case, a decision tree regression model is used to predict continuous values.

Now that we have discussed sklearn decision trees, let us check out the step-by-step implementation of the same.

Become a Data Scientist with Hands-on Training!

Data Scientist Master’s ProgramExplore Program
Become a Data Scientist with Hands-on Training!

Step-By-Step Implementation of Sklearn Decision Trees

Before getting into the coding part to implement decision trees, we need to collect the data in a proper format to build a decision tree. We will be using the iris dataset from the sklearn datasets databases, which is relatively straightforward and demonstrates how to construct a decision tree classifier.

The advantage of Scikit-Decision Learn’s Tree Classifier is that the target variable can either be numerical or categorized. Given the iris dataset, we will be preserving the categorical nature of the flowers for clarity reasons.

Let us now see how we can implement decision trees.

Importing the Dataset

import pandas as pd

import numpy as np

from sklearn.datasets import load_iris

data = load_iris()

#convert to a dataframe

df = pd.DataFrame(data.data, columns = data.feature_names)

#create the species column

df['Species'] = data.target

#replace this with the actual names

target = np.unique(data.target)

target_names = np.unique(data.target_names)

targets = dict(zip(target, target_names))

df['Species'] = df['Species'].replace(targets)

The following step will be used to extract our testing and training datasets. The goal is to guarantee that the model is not trained on all of the given data, enabling us to observe how it performs on data that hasn't been seen before. If we use all of the data as training data, we risk overfitting the model, meaning it will perform poorly on unknown data.

Extracting Datasets

x = df.drop(columns="Species")

y = df["Species"]

feature_names = x.columns

labels = y.unique()

#split the dataset

from sklearn.model_selection import train_test_split

X_train, test_x, y_train, test_lab = train_test_split(x,y,

                                                 test_size = 0.4,

                                                 random_state = 42)

Now that we have the data in the right format, we will build the decision tree in order to anticipate how the different flowers will be classified. The first step is to import the DecisionTreeClassifier package from the sklearn library. 

Importing Decision Tree Classifier

from sklearn.tree import DecisionTreeClassifier

As part of the next step, we need to apply this to the training data. The classifier is initialized to the clf for this purpose, with max depth = 3 and random state = 42. The max depth argument controls the tree's maximum depth. We use this to ensure that no overfitting is done and that we can simply see how the final result was obtained. The random state parameter assures that the results are repeatable in subsequent investigations. We will now fit the algorithm to the training data.

Become a Data Scientist with Hands-on Training!

Data Scientist Master’s ProgramExplore Program
Become a Data Scientist with Hands-on Training!

Fitting Algorithm to Training Data

clf = DecisionTreeClassifier(max_depth =3, random_state = 42)

clf.fit(X_train, y_train)

We want to be able to understand how the algorithm works, and one of the benefits of employing a decision tree classifier is that the output is simple to comprehend and visualize.

Checking the Algorithms

We can do this using the following two ways:

  1. As a tree diagram
  2. As a text-based diagram

Let us now see the detailed implementation of these:

1. As a Tree Diagram

from sklearn import tree

import matplotlib.pyplot as plt

plt.figure(figsize=(30,10), facecolor ='k')

a = tree.plot_tree(clf,

                   feature_names = feature_names,

                   class_names = labels,

                   rounded = True,

                   filled = True,

                   fontsize=14)

plt.show()

2. As a Text-Based Diagram

from sklearn.tree import export_text

tree_rules = export_text(clf,

                        feature_names = list(feature_names))

print(tree_rules)

Output

|--- PetalLengthCm <= 2.45

|   |--- class: Iris-setosa

|--- PetalLengthCm >  2.45

|   |--- PetalWidthCm <= 1.75

|   |   |--- PetalLengthCm <= 5.35

|   |   |   |--- class: Iris-versicolor

|   |   |--- PetalLengthCm >  5.35

|   |   |   |--- class: Iris-virginica

|   |--- PetalWidthCm >  1.75

|   |   |--- PetalLengthCm <= 4.85

|   |   |   |--- class: Iris-virginica

|   |   |--- PetalLengthCm >  4.85

|   |   |   |--- class: Iris-virginica

The first division is based on Petal Length, with those measuring less than 2.45 cm classified as Iris-setosa and those measuring more as Iris-virginica. For all those with petal lengths more than 2.45, a further split occurs, followed by two further splits to produce more precise final classifications.

Here, we are not only interested in how well it did on the training data, but we are also interested in how well it works on unknown test data. This implies we will need to utilize it to forecast the class based on the test results, which we will do with the predict() method.

Predict Class From Test Values

test_pred_decision_tree = clf.predict(test_x)

We are concerned about false negatives (predicted false but actually true), true positives (predicted true and actually true), false positives (predicted true but not actually true), and true negatives (predicted false and actually false). 

Examining the results in a confusion matrix is one approach to do so. A confusion matrix allows us to see how the predicted and true labels match up by displaying actual values on one axis and anticipated values on the other. This is useful for determining where we might get false negatives or negatives and how well the algorithm performed.

from sklearn import metrics

import seaborn as sns

import matplotlib.pyplot as plt

confusion_matrix = metrics.confusion_matrix(test_lab,  

                                            test_pred_decision_tree)

matrix_df = pd.DataFrame(confusion_matrix)

ax = plt.axes()

sns.set(font_scale=1.3)

plt.figure(figsize=(10,7))

sns.heatmap(matrix_df, annot=True, fmt="g", ax=ax, cmap="magma")

ax.set_title('Confusion Matrix - Decision Tree')

ax.set_xlabel("Predicted label", fontsize =15)

ax.set_xticklabels(['']+labels)

ax.set_ylabel("True Label", fontsize=15)

ax.set_yticklabels(list(labels), rotation = 0)

plt.show()

In the output above, only one value from the Iris-versicolor class has failed from being predicted from the unseen data. This indicates that this algorithm has done a good job at predicting unseen data overall.

Master SkLearn With Simplilearn

The advantages of employing a decision tree are that they are simple to follow and interpret, that they will be able to handle both categorical and numerical data, that they restrict the influence of weak predictors, and that their structure can be extracted for visualization. 

There are a few drawbacks, such as the possibility of biased trees if one class dominates, over-complex and large trees leading to a model overfit, and large differences in findings due to slight variances in the data. However, they can be quite useful in practice. They can be used in conjunction with other classification algorithms like random forests or k-nearest neighbors to understand how classifications are made and aid in decision-making.

To learn more about SkLearn decision trees and concepts related to data science, enroll in Simplilearn’s  Data Science Certification and learn from the best in the industry and master data science and machine learning key concepts within a year!

Data Science & Business Analytics Courses Duration and Fees

Data Science & Business Analytics programs typically range from a few weeks to several months, with fees varying based on program and institution.

Program NameDurationFees
Post Graduate Program in Data Science

Cohort Starts: 6 May, 2024

11 Months$ 4,199
Post Graduate Program in Data Analytics

Cohort Starts: 6 May, 2024

8 Months$ 3,749
Caltech Post Graduate Program in Data Science

Cohort Starts: 9 May, 2024

11 Months$ 4,500
Applied AI & Data Science

Cohort Starts: 14 May, 2024

3 Months$ 2,624
Data Analytics Bootcamp

Cohort Starts: 24 Jun, 2024

6 Months$ 8,500
Data Scientist11 Months$ 1,449
Data Analyst11 Months$ 1,449