Feature Transformation & Binning#

  • This is a supplement material for the Machine Learning Simplified book. It sheds light on Python implementations of the topics discussed while all detailed explanations can be found in the book.

  • I also assume you know Python syntax and how it works. If you don’t, I highly recommend you to take a break and get introduced to the language before going forward with my code.

  • This material can be downloaded as a Jupyter notebook (Download button in the upper-right corner -> .ipynb) to reproduce the code and play around with it.

This notebook is a supplement for Chapter 7. Data Preparation of Machine Learning For Everyone book.

1. Required Libraries, Data & Variables#

Let’s import the data and have a look at it:

import pandas as pd
import numpy as np
from sklearn.preprocessing import OneHotEncoder, LabelEncoder, StandardScaler


# Define the data as a dictionary
data = {
    "Age": [32, 46, 25, 36, 29, 54],
    "Income (€)": [95000, 210000, 75000, 30000, 55000, 430000],
    "Vehicle": ["none", "car", "truck", "car", "none", "car"],
    "Kids": ["no", "yes", "yes", "yes", "no", "yes"],
    "Residence": ["downtown", "downtown", "suburbs", "suburbs", "suburbs", "downtown"]
}

# Create DataFrame from the dictionary
df = pd.DataFrame(data)

# Print the DataFrame
df
Age Income (€) Vehicle Kids Residence
0 32 95000 none no downtown
1 46 210000 car yes downtown
2 25 75000 truck yes suburbs
3 36 30000 car yes suburbs
4 29 55000 none no suburbs
5 54 430000 car yes downtown

2. Feature Encoding#

After having cleaned your data, you must encode it in a way such that the ML algorithm can consume it.One important thing you must do is encode complex data types, like strings or categorical variables, in a numeric format.

We will illustrate feature encoding on the dataset above, where the three independent variables are Income, Vehicle, and Kids, each of which are categorical variables, and the target variable is a person’s Residence (whether a person lives in downtown or in suburbs).

2.1. Apply One-Hot Encoding to “Vehicle” and “Kids”#

# One-hot encode the 'Vehicle' and 'Kids' columns
ohe = OneHotEncoder(sparse=False)
encoded_features = pd.DataFrame(ohe.fit_transform(df[['Vehicle', 'Kids']]))

# Get new column names from OneHotEncoder
encoded_features.columns = ohe.get_feature_names_out(['Vehicle', 'Kids'])

# Concatenate the encoded features back to the original DataFrame
df_encoded = pd.concat([df, encoded_features], axis=1).drop(['Vehicle', 'Kids'], axis=1)
df_encoded
/Users/andrewwolf/.pyenv/versions/3.10.7/lib/python3.10/site-packages/sklearn/preprocessing/_encoders.py:808: FutureWarning: `sparse` was renamed to `sparse_output` in version 1.2 and will be removed in 1.4. `sparse_output` is ignored unless you leave `sparse` to its default value.
  warnings.warn(
Age Income (€) Residence Vehicle_car Vehicle_none Vehicle_truck Kids_no Kids_yes
0 32 95000 downtown 0.0 1.0 0.0 1.0 0.0
1 46 210000 downtown 1.0 0.0 0.0 0.0 1.0
2 25 75000 suburbs 0.0 0.0 1.0 0.0 1.0
3 36 30000 suburbs 1.0 0.0 0.0 0.0 1.0
4 29 55000 suburbs 0.0 1.0 0.0 1.0 0.0
5 54 430000 downtown 1.0 0.0 0.0 0.0 1.0

2.2. Apply Label Encoding to “Residence”#

# Label encode the 'Residence' column
le = LabelEncoder()
df_encoded['Residence'] = le.fit_transform(df_encoded['Residence'])
df_encoded
Age Income (€) Residence Vehicle_car Vehicle_none Vehicle_truck Kids_no Kids_yes
0 32 95000 0 0.0 1.0 0.0 1.0 0.0
1 46 210000 0 1.0 0.0 0.0 0.0 1.0
2 25 75000 1 0.0 0.0 1.0 0.0 1.0
3 36 30000 1 1.0 0.0 0.0 0.0 1.0
4 29 55000 1 0.0 1.0 0.0 1.0 0.0
5 54 430000 0 1.0 0.0 0.0 0.0 1.0

3. Feature Scaling#

Many datasets contain numeric features with significantly different numeric scales.

For example, the Age feature ranges from 27 to 54 (years), while the Income feature ranges from \(30,000\) EUR to \(430,000\) EUR, while the features Vehicle_none, Vehicle_car, Vehicle_truck, Kids_yes and Kids_no all have the range from \(0\) to \(1\).

Unscaled data will, technically, not prohibit the ML algorithm from running, but can often lead to problems in the learning algorithm.

For example, since the Income feature has much larger value than the other features, it will influence the target variable much more. However, some ML models like decision trees are invariant to feature scaling.

But we don’t necessarily want this to be the case.

To ensure that the measurement scale doesn’t adversely affect our learning algorithm, we scale, or normalize, each feature to a common range of values. Here is an example Python code that demonstrates how to scale your features using StandardScaler:

# Initialize the StandardScaler
scaler = StandardScaler()
# List of all the columns
df_encoded.columns
Index(['Age', 'Income (€)', 'Residence', 'Vehicle_car', 'Vehicle_none',
       'Vehicle_truck', 'Kids_no', 'Kids_yes'],
      dtype='object')
# List of columns to scale (we take all but| target variable)
columns_to_scale = ['Age', 'Income (€)', 'Vehicle_car', 'Vehicle_none',
       'Vehicle_truck', 'Kids_no', 'Kids_yes']
# Fit the scaler to the data and transform it
df_encoded[columns_to_scale] = scaler.fit_transform(df_encoded[columns_to_scale])
df_encoded
Age Income (€) Residence Vehicle_car Vehicle_none Vehicle_truck Kids_no Kids_yes
0 -0.498342 -0.392844 0 -1.0 1.414214 -0.447214 1.414214 -1.414214
1 0.897015 0.441194 0 1.0 -0.707107 -0.447214 -0.707107 0.707107
2 -1.196020 -0.537894 1 -1.0 -0.707107 2.236068 -0.707107 0.707107
3 -0.099668 -0.864257 1 1.0 -0.707107 -0.447214 -0.707107 0.707107
4 -0.797347 -0.682945 1 -1.0 1.414214 -0.447214 1.414214 -1.414214
5 1.694362 2.036746 0 1.0 -0.707107 -0.447214 -0.707107 0.707107

4. Feature Binning#

Feature binning is the process that converts a numerical (either continuous and discrete) feature into a categorical feature represented by a set of ranges, or bins.

For example, instead of representing age as a single real-valued feature, we chop ranges of age into 3 discrete bins: $\( \begin{equation*} young \in [ages \ 25 - 34], \qquad middle \in [ages \ 35 - 44], \qquad old \in [ages \ 45 - 54] \end{equation*} \)$

4.1. General Approach#

To implement feature binning or discretization of the “Age” variable into categorical bins using Python, you can use the pandas library which provides a straightforward method called cut for binning continuous variables. Here’s how you can convert the age into three categories based on the provided ranges:

df
Age Income (€) Vehicle Kids Residence
0 32 95000 none no downtown
1 46 210000 car yes downtown
2 25 75000 truck yes suburbs
3 36 30000 car yes suburbs
4 29 55000 none no suburbs
5 54 430000 car yes downtown
# Define bins and their labels
bins = [24, 34, 44, 54]  # Extend ranges to include all possible ages in each group
labels = ['Young', 'Middle', 'Old']

# Perform binning
df['Age Group'] = pd.cut(df['Age'], bins=bins, labels=labels, right=True)  # right=True means inclusive on the right
df
Age Income (€) Vehicle Kids Residence Age Group
0 32 95000 none no downtown Young
1 46 210000 car yes downtown Old
2 25 75000 truck yes suburbs Young
3 36 30000 car yes suburbs Middle
4 29 55000 none no suburbs Young
5 54 430000 car yes downtown Old

This approach allows for clear and meaningful categorization of ages, which can be very useful for analysis or as a feature in machine learning models where age is an important factor.

4.2. Equal Width Binning#

Equal width binning divides the range of values of a feature into bins with equal width.

Usually, we specify the number of bins as a hyper-parameter \(K\), and then compute the width of each bin as

\[ \begin{equation} w = \Big[\frac{max^{(j)} - min^{(j)}}{K}\Big] \end{equation} \]

where \(max^{(j)}\) and \(min^{(j)}\) are the \(j^{th}\) feature’s maximum and minimum values, respectively.

The ranges of the \(K\) bins are then $\( \begin{equation} \begin{split} Bin \ 1&: [min, \ min + w - 1] \\ Bin \ 2&: [min+w, \ min + 2\cdot w - 1] \\ ... \\ Bin \ K&: [min + (K-1)\cdot w, \ max] \label{eq:equal_width_binning} \end{split} \end{equation} \)$

As an example of equal width binning, consider splitting the Age feature in the Amsterdam demographics dataset into \(K=3\) bins.

The bin’s width is: $\( \begin{equation*} w = \Big[\frac{max-min}{x}\Big] = \Big[\frac{54-25}{3}\Big] = 9.7 \approx 10 \end{equation*} \)$

which we rounded to the nearest integer because Age values are always integers (in this dataset).

To implement equal width binning in Python and calculate each bin’s range for the “Age” feature of the Amsterdam demographics dataset using \(K=3\) bins, we can use the numpy library to help with calculations and then use pandas for binning. Here’s how you can perform this task:

# Number of bins
K = 3

# Calculate the width of each bin
min_age = df['Age'].min()
max_age = df['Age'].max()
width = (max_age - min_age) // K
# Define bins using calculated width
bins = np.linspace(min_age, max_age, num=K+1)

print(bins)
[25.         34.66666667 44.33333333 54.        ]
# Create bin labels
labels = [f'Bin {i+1}' for i in range(K)]

print(labels)
['Bin 1', 'Bin 2', 'Bin 3']
# Perform binning
df['Age Bin'] = pd.cut(df['Age'], bins=bins, labels=labels, include_lowest=True, right=True)
df
Age Income (€) Vehicle Kids Residence Age Group Age Bin
0 32 95000 none no downtown Young Bin 1
1 46 210000 car yes downtown Old Bin 3
2 25 75000 truck yes suburbs Young Bin 1
3 36 30000 car yes suburbs Middle Bin 2
4 29 55000 none no suburbs Young Bin 1
5 54 430000 car yes downtown Old Bin 3