Bagging Models#
This is a supplement material for the Machine Learning Simplified book. It sheds light on Python implementations of the topics discussed while all detailed explanations can be found in the book.
I also assume you know Python syntax and how it works. If you don’t, I highly recommend you to take a break and get introduced to the language before going forward with my code.
This material can be downloaded as a Jupyter notebook (Download button in the upper-right corner ->
.ipynb
) to reproduce the code and play around with it.
This notebook is a supplement for Chapter 9. Ensemble Models of Machine Learning For Everyone book.
1. Required Libraries, Data & Variables#
Let’s import the data and have a look at it:
import pandas as pd
data = {
'Day': list(range(1, 31)),
'Temperature': [
'Cold', 'Hot', 'Cold', 'Hot', 'Hot',
'Cold', 'Hot', 'Cold', 'Hot', 'Cold',
'Hot', 'Cold', 'Hot', 'Cold', 'Hot',
'Cold', 'Hot', 'Cold', 'Hot', 'Cold',
'Hot', 'Cold', 'Hot', 'Cold', 'Hot',
'Cold', 'Hot', 'Cold', 'Hot', 'Cold'
],
'Humidity': [
'Normal', 'Normal', 'Normal', 'High', 'High',
'Normal', 'High', 'Normal', 'High', 'Normal',
'High', 'Normal', 'High', 'Normal', 'High',
'Normal', 'High', 'Normal', 'High', 'Normal',
'High', 'Normal', 'High', 'Normal', 'High',
'Normal', 'High', 'Normal', 'High', 'Normal'
],
'Outlook': [
'Rain', 'Rain', 'Sunny', 'Sunny', 'Rain',
'Sunny', 'Rain', 'Sunny', 'Rain', 'Sunny',
'Rain', 'Sunny', 'Rain', 'Sunny', 'Rain',
'Sunny', 'Rain', 'Sunny', 'Rain', 'Sunny',
'Rain', 'Sunny', 'Rain', 'Sunny', 'Rain',
'Sunny', 'Rain', 'Sunny', 'Rain', 'Sunny'
],
'Wind': [
'Strong', 'Weak', 'Weak', 'Weak', 'Weak',
'Strong', 'Weak', 'Weak', 'Weak', 'Strong',
'Weak', 'Weak', 'Strong', 'Weak', 'Weak',
'Weak', 'Strong', 'Weak', 'Weak', 'Weak',
'Strong', 'Weak', 'Weak', 'Weak', 'Weak',
'Strong', 'Weak', 'Weak', 'Weak', 'Strong'
],
'Golf Played': [
'No', 'No', 'Yes', 'Yes', 'Yes',
'No', 'Yes', 'No', 'Yes', 'Yes',
'No', 'Yes', 'No', 'Yes', 'Yes',
'No', 'Yes', 'No', 'Yes', 'Yes',
'No', 'Yes', 'No', 'Yes', 'Yes',
'No', 'Yes', 'No', 'Yes', 'Yes'
]
}
# Converting the dictionary into a DataFrame
df = pd.DataFrame(data)
# Displaying the DataFrame
df.head(10)
Day | Temperature | Humidity | Outlook | Wind | Golf Played | |
---|---|---|---|---|---|---|
0 | 1 | Cold | Normal | Rain | Strong | No |
1 | 2 | Hot | Normal | Rain | Weak | No |
2 | 3 | Cold | Normal | Sunny | Weak | Yes |
3 | 4 | Hot | High | Sunny | Weak | Yes |
4 | 5 | Hot | High | Rain | Weak | Yes |
5 | 6 | Cold | Normal | Sunny | Strong | No |
6 | 7 | Hot | High | Rain | Weak | Yes |
7 | 8 | Cold | Normal | Sunny | Weak | No |
8 | 9 | Hot | High | Rain | Weak | Yes |
9 | 10 | Cold | Normal | Sunny | Strong | Yes |
2. Preparation of the Dataset#
One-hot encoding the categorical variables
from sklearn.preprocessing import OneHotEncoder
encoder = OneHotEncoder(sparse=False)
encoded_features = encoder.fit_transform(df[['Temperature', 'Humidity', 'Outlook', 'Wind']])
encoded_df = pd.DataFrame(encoded_features, columns=encoder.get_feature_names_out(['Temperature', 'Humidity', 'Outlook', 'Wind']))
/Users/andrewwolf/.pyenv/versions/3.10.7/lib/python3.10/site-packages/sklearn/preprocessing/_encoders.py:808: FutureWarning: `sparse` was renamed to `sparse_output` in version 1.2 and will be removed in 1.4. `sparse_output` is ignored unless you leave `sparse` to its default value.
warnings.warn(
Visualizing the first 10 records of the encoded dataframe:
encoded_df.head(10)
Temperature_Cold | Temperature_Hot | Humidity_High | Humidity_Normal | Outlook_Rain | Outlook_Sunny | Wind_Strong | Wind_Weak | |
---|---|---|---|---|---|---|---|---|
0 | 1.0 | 0.0 | 0.0 | 1.0 | 1.0 | 0.0 | 1.0 | 0.0 |
1 | 0.0 | 1.0 | 0.0 | 1.0 | 1.0 | 0.0 | 0.0 | 1.0 |
2 | 1.0 | 0.0 | 0.0 | 1.0 | 0.0 | 1.0 | 0.0 | 1.0 |
3 | 0.0 | 1.0 | 1.0 | 0.0 | 0.0 | 1.0 | 0.0 | 1.0 |
4 | 0.0 | 1.0 | 1.0 | 0.0 | 1.0 | 0.0 | 0.0 | 1.0 |
5 | 1.0 | 0.0 | 0.0 | 1.0 | 0.0 | 1.0 | 1.0 | 0.0 |
6 | 0.0 | 1.0 | 1.0 | 0.0 | 1.0 | 0.0 | 0.0 | 1.0 |
7 | 1.0 | 0.0 | 0.0 | 1.0 | 0.0 | 1.0 | 0.0 | 1.0 |
8 | 0.0 | 1.0 | 1.0 | 0.0 | 1.0 | 0.0 | 0.0 | 1.0 |
9 | 1.0 | 0.0 | 0.0 | 1.0 | 0.0 | 1.0 | 1.0 | 0.0 |
Adding the encoded features back to the dataframe
df = df.join(encoded_df)
df.head(5)
Day | Temperature | Humidity | Outlook | Wind | Golf Played | Temperature_Cold | Temperature_Hot | Humidity_High | Humidity_Normal | Outlook_Rain | Outlook_Sunny | Wind_Strong | Wind_Weak | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | Cold | Normal | Rain | Strong | No | 1.0 | 0.0 | 0.0 | 1.0 | 1.0 | 0.0 | 1.0 | 0.0 |
1 | 2 | Hot | Normal | Rain | Weak | No | 0.0 | 1.0 | 0.0 | 1.0 | 1.0 | 0.0 | 0.0 | 1.0 |
2 | 3 | Cold | Normal | Sunny | Weak | Yes | 1.0 | 0.0 | 0.0 | 1.0 | 0.0 | 1.0 | 0.0 | 1.0 |
3 | 4 | Hot | High | Sunny | Weak | Yes | 0.0 | 1.0 | 1.0 | 0.0 | 0.0 | 1.0 | 0.0 | 1.0 |
4 | 5 | Hot | High | Rain | Weak | Yes | 0.0 | 1.0 | 1.0 | 0.0 | 1.0 | 0.0 | 0.0 | 1.0 |
Preparing the features by removing categorical variables.
X = df.drop(['Day', 'Temperature', 'Humidity', 'Outlook', 'Wind', 'Golf Played'], axis=1)
X.head(5)
Temperature_Cold | Temperature_Hot | Humidity_High | Humidity_Normal | Outlook_Rain | Outlook_Sunny | Wind_Strong | Wind_Weak | |
---|---|---|---|---|---|---|---|---|
0 | 1.0 | 0.0 | 0.0 | 1.0 | 1.0 | 0.0 | 1.0 | 0.0 |
1 | 0.0 | 1.0 | 0.0 | 1.0 | 1.0 | 0.0 | 0.0 | 1.0 |
2 | 1.0 | 0.0 | 0.0 | 1.0 | 0.0 | 1.0 | 0.0 | 1.0 |
3 | 0.0 | 1.0 | 1.0 | 0.0 | 0.0 | 1.0 | 0.0 | 1.0 |
4 | 0.0 | 1.0 | 1.0 | 0.0 | 1.0 | 0.0 | 0.0 | 1.0 |
Defining y:
y = df['Golf Played']
y
0 No
1 No
2 Yes
3 Yes
4 Yes
5 No
6 Yes
7 No
8 Yes
9 Yes
10 No
11 Yes
12 No
13 Yes
14 Yes
15 No
16 Yes
17 No
18 Yes
19 Yes
20 No
21 Yes
22 No
23 Yes
24 Yes
25 No
26 Yes
27 No
28 Yes
29 Yes
Name: Golf Played, dtype: object
Splitting the dataset into training and testing sets
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
3. Bagging Ensemble#
3.1. Building a Boosting Ensemble#
Creating the Gradient Boosting classifier
from sklearn.ensemble import BaggingClassifier
# Creating the Bagging classifier
# Using a DecisionTreeClassifier as the base classifier
model = BaggingClassifier(
base_estimator=DecisionTreeClassifier(),
n_estimators=10, # Number of trees
max_samples=0.8, # Fraction of samples to draw from X to train each base estimator
max_features=0.8, # Fraction of features to draw from X to train each base estimator
random_state=42
)
model.fit(X_train, y_train)
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
Cell In [10], line 4
1 # Creating the Bagging classifier
2 # Using a DecisionTreeClassifier as the base classifier
3 model = BaggingClassifier(
----> 4 base_estimator=DecisionTreeClassifier(),
5 n_estimators=10, # Number of trees
6 max_samples=0.8, # Fraction of samples to draw from X to train each base estimator
7 max_features=0.8, # Fraction of features to draw from X to train each base estimator
8 random_state=42
9 )
10 model.fit(X_train, y_train)
NameError: name 'DecisionTreeClassifier' is not defined
3.2. Visualizing boosted ensemble#
from sklearn.tree import DecisionTreeClassifier, plot_tree
# Building 5 decision trees
feature_names = encoder.get_feature_names_out(['Temperature', 'Humidity', 'Outlook', 'Wind'])
trees = [DecisionTreeClassifier(criterion='gini', max_depth=3, random_state=42 + i) for i in range(5)]
for tree in trees:
tree.fit(X_train, y_train)
# Plotting all 5 trees
fig, axes = plt.subplots(nrows=1, ncols=5, figsize=(20, 4), dpi=300)
for i, tree in enumerate(trees):
plot_tree(tree, feature_names=feature_names, class_names=['No', 'Yes'], filled=True, ax=axes[i])
axes[i].set_title(f'Tree {i+1}')
plt.tight_layout()
plt.show()
3.3. Predicting the Results#
Predicting the test set results
y_pred = model.predict(X_test)
y_pred
3.4. Evaluating the model#
from sklearn.metrics import accuracy_score, classification_report
accuracy = accuracy_score(y_test, y_pred)
report = classification_report(y_test, y_pred)
print("Accuracy:", accuracy)
print("Classification Report:\n", report)
4. Random Forest Classifier#
4.1. Building a Boosting Ensemble#
Creating the Random Forest Classifier
from sklearn.ensemble import RandomForestClassifier
random_forest = RandomForestClassifier(n_estimators=100, criterion='gini', max_depth=3, random_state=42)
random_forest.fit(X_train, y_train)
4.2. Predicting the Results#
# Making predictions on the test set
y_pred = random_forest.predict(X_test)
y_pred
4.3. Evaluating the model#
# Evaluating the model
accuracy = accuracy_score(y_test, y_pred)
report = classification_report(y_test, y_pred)
print("Accuracy:", accuracy)
print("Classification Report:\n", report)