Bagging Models

Bagging Models#

This is a supplement material for the Machine Learning Simplified book. It sheds light on Python implementations of the topics discussed while all detailed explanations can be found in the book.
I also assume you know Python syntax and how it works. If you don’t, I highly recommend you to take a break and get introduced to the language before going forward with my code.
This material can be downloaded as a Jupyter notebook (Download button in the upper-right corner -> .ipynb) to reproduce the code and play around with it.

This notebook is a supplement for Chapter 9. Ensemble Models of Machine Learning For Everyone book.

1. Required Libraries, Data & Variables#

Let’s import the data and have a look at it:

import pandas as pd

data = {
    'Day': list(range(1, 31)),
    'Temperature': [
        'Cold', 'Hot', 'Cold', 'Hot', 'Hot',
        'Cold', 'Hot', 'Cold', 'Hot', 'Cold',
        'Hot', 'Cold', 'Hot', 'Cold', 'Hot',
        'Cold', 'Hot', 'Cold', 'Hot', 'Cold',
        'Hot', 'Cold', 'Hot', 'Cold', 'Hot',
        'Cold', 'Hot', 'Cold', 'Hot', 'Cold'
    ],
    'Humidity': [
        'Normal', 'Normal', 'Normal', 'High', 'High',
        'Normal', 'High', 'Normal', 'High', 'Normal',
        'High', 'Normal', 'High', 'Normal', 'High',
        'Normal', 'High', 'Normal', 'High', 'Normal',
        'High', 'Normal', 'High', 'Normal', 'High',
        'Normal', 'High', 'Normal', 'High', 'Normal'
    ],
    'Outlook': [
        'Rain', 'Rain', 'Sunny', 'Sunny', 'Rain',
        'Sunny', 'Rain', 'Sunny', 'Rain', 'Sunny',
        'Rain', 'Sunny', 'Rain', 'Sunny', 'Rain',
        'Sunny', 'Rain', 'Sunny', 'Rain', 'Sunny',
        'Rain', 'Sunny', 'Rain', 'Sunny', 'Rain',
        'Sunny', 'Rain', 'Sunny', 'Rain', 'Sunny'
    ],
    'Wind': [
        'Strong', 'Weak', 'Weak', 'Weak', 'Weak',
        'Strong', 'Weak', 'Weak', 'Weak', 'Strong',
        'Weak', 'Weak', 'Strong', 'Weak', 'Weak',
        'Weak', 'Strong', 'Weak', 'Weak', 'Weak',
        'Strong', 'Weak', 'Weak', 'Weak', 'Weak',
        'Strong', 'Weak', 'Weak', 'Weak', 'Strong'
    ],
    'Golf Played': [
        'No', 'No', 'Yes', 'Yes', 'Yes',
        'No', 'Yes', 'No', 'Yes', 'Yes',
        'No', 'Yes', 'No', 'Yes', 'Yes',
        'No', 'Yes', 'No', 'Yes', 'Yes',
        'No', 'Yes', 'No', 'Yes', 'Yes',
        'No', 'Yes', 'No', 'Yes', 'Yes'
    ]
}

# Converting the dictionary into a DataFrame
df = pd.DataFrame(data)

# Displaying the DataFrame
df.head(10)

	Day	Temperature	Humidity	Outlook	Wind	Golf Played
0	1	Cold	Normal	Rain	Strong	No
1	2	Hot	Normal	Rain	Weak	No
2	3	Cold	Normal	Sunny	Weak	Yes
3	4	Hot	High	Sunny	Weak	Yes
4	5	Hot	High	Rain	Weak	Yes
5	6	Cold	Normal	Sunny	Strong	No
6	7	Hot	High	Rain	Weak	Yes
7	8	Cold	Normal	Sunny	Weak	No
8	9	Hot	High	Rain	Weak	Yes
9	10	Cold	Normal	Sunny	Strong	Yes

2. Preparation of the Dataset#

One-hot encoding the categorical variables

from sklearn.preprocessing import OneHotEncoder

encoder = OneHotEncoder(sparse=False)
encoded_features = encoder.fit_transform(df[['Temperature', 'Humidity', 'Outlook', 'Wind']])
encoded_df = pd.DataFrame(encoded_features, columns=encoder.get_feature_names_out(['Temperature', 'Humidity', 'Outlook', 'Wind']))

/Users/andrewwolf/.pyenv/versions/3.10.7/lib/python3.10/site-packages/sklearn/preprocessing/_encoders.py:808: FutureWarning: `sparse` was renamed to `sparse_output` in version 1.2 and will be removed in 1.4. `sparse_output` is ignored unless you leave `sparse` to its default value.
  warnings.warn(

Visualizing the first 10 records of the encoded dataframe:

encoded_df.head(10)

	Temperature_Cold	Temperature_Hot	Humidity_High	Humidity_Normal	Outlook_Rain	Outlook_Sunny	Wind_Strong	Wind_Weak
0	1.0	0.0	0.0	1.0	1.0	0.0	1.0	0.0
1	0.0	1.0	0.0	1.0	1.0	0.0	0.0	1.0
2	1.0	0.0	0.0	1.0	0.0	1.0	0.0	1.0
3	0.0	1.0	1.0	0.0	0.0	1.0	0.0	1.0
4	0.0	1.0	1.0	0.0	1.0	0.0	0.0	1.0
5	1.0	0.0	0.0	1.0	0.0	1.0	1.0	0.0
6	0.0	1.0	1.0	0.0	1.0	0.0	0.0	1.0
7	1.0	0.0	0.0	1.0	0.0	1.0	0.0	1.0
8	0.0	1.0	1.0	0.0	1.0	0.0	0.0	1.0
9	1.0	0.0	0.0	1.0	0.0	1.0	1.0	0.0

Adding the encoded features back to the dataframe

df = df.join(encoded_df)

df.head(5)

	Day	Temperature	Humidity	Outlook	Wind	Golf Played	Temperature_Cold	Temperature_Hot	Humidity_High	Humidity_Normal	Outlook_Rain	Outlook_Sunny	Wind_Strong	Wind_Weak
0	1	Cold	Normal	Rain	Strong	No	1.0	0.0	0.0	1.0	1.0	0.0	1.0	0.0
1	2	Hot	Normal	Rain	Weak	No	0.0	1.0	0.0	1.0	1.0	0.0	0.0	1.0
2	3	Cold	Normal	Sunny	Weak	Yes	1.0	0.0	0.0	1.0	0.0	1.0	0.0	1.0
3	4	Hot	High	Sunny	Weak	Yes	0.0	1.0	1.0	0.0	0.0	1.0	0.0	1.0
4	5	Hot	High	Rain	Weak	Yes	0.0	1.0	1.0	0.0	1.0	0.0	0.0	1.0

Preparing the features by removing categorical variables.

X = df.drop(['Day', 'Temperature', 'Humidity', 'Outlook', 'Wind', 'Golf Played'], axis=1)
X.head(5)

	Temperature_Cold	Temperature_Hot	Humidity_High	Humidity_Normal	Outlook_Rain	Outlook_Sunny	Wind_Strong	Wind_Weak
0	1.0	0.0	0.0	1.0	1.0	0.0	1.0	0.0
1	0.0	1.0	0.0	1.0	1.0	0.0	0.0	1.0
2	1.0	0.0	0.0	1.0	0.0	1.0	0.0	1.0
3	0.0	1.0	1.0	0.0	0.0	1.0	0.0	1.0
4	0.0	1.0	1.0	0.0	1.0	0.0	0.0	1.0

Defining y:

y = df['Golf Played']

y

    No
    No
   Yes
   Yes
   Yes
    No
   Yes
    No
   Yes
   Yes
   No
  Yes
   No
  Yes
  Yes
   No
  Yes
   No
  Yes
  Yes
   No
  Yes
   No
  Yes
  Yes
   No
  Yes
   No
  Yes
  Yes
Name: Golf Played, dtype: object

Splitting the dataset into training and testing sets

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

3. Bagging Ensemble#

3.1. Building a Boosting Ensemble#

Creating the Gradient Boosting classifier

from sklearn.ensemble import BaggingClassifier

# Creating the Bagging classifier
# Using a DecisionTreeClassifier as the base classifier
model = BaggingClassifier(
                            base_estimator=DecisionTreeClassifier(), 
                            n_estimators=10,  # Number of trees
                            max_samples=0.8,  # Fraction of samples to draw from X to train each base estimator
                            max_features=0.8,  # Fraction of features to draw from X to train each base estimator
                            random_state=42
                         )
model.fit(X_train, y_train)

---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In [10], line 4
      1 # Creating the Bagging classifier
      2 # Using a DecisionTreeClassifier as the base classifier
      3 model = BaggingClassifier(
----> 4                             base_estimator=DecisionTreeClassifier(), 
      5                             n_estimators=10,  # Number of trees
      6                             max_samples=0.8,  # Fraction of samples to draw from X to train each base estimator
      7                             max_features=0.8,  # Fraction of features to draw from X to train each base estimator
      8                             random_state=42
      9                          )
     10 model.fit(X_train, y_train)

NameError: name 'DecisionTreeClassifier' is not defined

3.2. Visualizing boosted ensemble#

from sklearn.tree import DecisionTreeClassifier, plot_tree

# Building 5 decision trees
feature_names = encoder.get_feature_names_out(['Temperature', 'Humidity', 'Outlook', 'Wind'])
trees = [DecisionTreeClassifier(criterion='gini', max_depth=3, random_state=42 + i) for i in range(5)]
for tree in trees:
    tree.fit(X_train, y_train)

# Plotting all 5 trees
fig, axes = plt.subplots(nrows=1, ncols=5, figsize=(20, 4), dpi=300)
for i, tree in enumerate(trees):
    plot_tree(tree, feature_names=feature_names, class_names=['No', 'Yes'], filled=True, ax=axes[i])
    axes[i].set_title(f'Tree {i+1}')

plt.tight_layout()
plt.show()

3.3. Predicting the Results#

Predicting the test set results

y_pred = model.predict(X_test)

y_pred

3.4. Evaluating the model#

from sklearn.metrics import accuracy_score, classification_report

accuracy = accuracy_score(y_test, y_pred)
report = classification_report(y_test, y_pred)

print("Accuracy:", accuracy)
print("Classification Report:\n", report)

4. Random Forest Classifier#

4.1. Building a Boosting Ensemble#

Creating the Random Forest Classifier

from sklearn.ensemble import RandomForestClassifier

random_forest = RandomForestClassifier(n_estimators=100, criterion='gini', max_depth=3, random_state=42)

random_forest.fit(X_train, y_train)

4.2. Predicting the Results#

# Making predictions on the test set
y_pred = random_forest.predict(X_test)

y_pred

4.3. Evaluating the model#

# Evaluating the model
accuracy = accuracy_score(y_test, y_pred)
report = classification_report(y_test, y_pred)

print("Accuracy:", accuracy)
print("Classification Report:\n", report)

	Temperature_Cold	Temperature_Hot	Humidity_High	Humidity_Normal	Outlook_Rain	Outlook_Sunny	Wind_Strong	Wind_Weak
0	1.0	0.0	0.0	1.0	1.0	0.0	1.0	0.0
1	0.0	1.0	0.0	1.0	1.0	0.0	0.0	1.0
2	1.0	0.0	0.0	1.0	0.0	1.0	0.0	1.0
3	0.0	1.0	1.0	0.0	0.0	1.0	0.0	1.0
4	0.0	1.0	1.0	0.0	1.0	0.0	0.0	1.0
5	1.0	0.0	0.0	1.0	0.0	1.0	1.0	0.0
6	0.0	1.0	1.0	0.0	1.0	0.0	0.0	1.0
7	1.0	0.0	0.0	1.0	0.0	1.0	0.0	1.0
8	0.0	1.0	1.0	0.0	1.0	0.0	0.0	1.0
9	1.0	0.0	0.0	1.0	0.0	1.0	1.0	0.0

	Temperature_Cold	Temperature_Hot	Humidity_High	Humidity_Normal	Outlook_Rain	Outlook_Sunny	Wind_Strong	Wind_Weak
0	1.0	0.0	0.0	1.0	1.0	0.0	1.0	0.0
1	0.0	1.0	0.0	1.0	1.0	0.0	0.0	1.0
2	1.0	0.0	0.0	1.0	0.0	1.0	0.0	1.0
3	0.0	1.0	1.0	0.0	0.0	1.0	0.0	1.0
4	0.0	1.0	1.0	0.0	1.0	0.0	0.0	1.0

	Temperature_Cold	Temperature_Hot	Humidity_High	Humidity_Normal	Outlook_Rain	Outlook_Sunny	Wind_Strong	Wind_Weak
0	1.0	0.0	0.0	1.0	1.0	0.0	1.0	0.0
1	0.0	1.0	0.0	1.0	1.0	0.0	0.0	1.0
2	1.0	0.0	0.0	1.0	0.0	1.0	0.0	1.0
3	0.0	1.0	1.0	0.0	0.0	1.0	0.0	1.0
4	0.0	1.0	1.0	0.0	1.0	0.0	0.0	1.0
5	1.0	0.0	0.0	1.0	0.0	1.0	1.0	0.0
6	0.0	1.0	1.0	0.0	1.0	0.0	0.0	1.0
7	1.0	0.0	0.0	1.0	0.0	1.0	0.0	1.0
8	0.0	1.0	1.0	0.0	1.0	0.0	0.0	1.0
9	1.0	0.0	0.0	1.0	0.0	1.0	1.0	0.0

	Temperature_Cold	Temperature_Hot	Humidity_High	Humidity_Normal	Outlook_Rain	Outlook_Sunny	Wind_Strong	Wind_Weak
0	1.0	0.0	0.0	1.0	1.0	0.0	1.0	0.0
1	0.0	1.0	0.0	1.0	1.0	0.0	0.0	1.0
2	1.0	0.0	0.0	1.0	0.0	1.0	0.0	1.0
3	0.0	1.0	1.0	0.0	0.0	1.0	0.0	1.0
4	0.0	1.0	1.0	0.0	1.0	0.0	0.0	1.0