Feature Transformation & Binning

Feature Transformation & Binning#

This is a supplement material for the Machine Learning Simplified book. It sheds light on Python implementations of the topics discussed while all detailed explanations can be found in the book.
I also assume you know Python syntax and how it works. If you don’t, I highly recommend you to take a break and get introduced to the language before going forward with my code.
This material can be downloaded as a Jupyter notebook (Download button in the upper-right corner -> .ipynb) to reproduce the code and play around with it.

This notebook is a supplement for Chapter 7. Data Preparation of Machine Learning For Everyone book.

1. Required Libraries, Data & Variables#

Let’s import the data and have a look at it:

import pandas as pd
import numpy as np
from sklearn.preprocessing import OneHotEncoder, LabelEncoder, StandardScaler


# Define the data as a dictionary
data = {
    "Age": [32, 46, 25, 36, 29, 54],
    "Income (€)": [95000, 210000, 75000, 30000, 55000, 430000],
    "Vehicle": ["none", "car", "truck", "car", "none", "car"],
    "Kids": ["no", "yes", "yes", "yes", "no", "yes"],
    "Residence": ["downtown", "downtown", "suburbs", "suburbs", "suburbs", "downtown"]
}

# Create DataFrame from the dictionary
df = pd.DataFrame(data)

# Print the DataFrame
df

	Age	Income (€)	Vehicle	Kids	Residence
0	32	95000	none	no	downtown
1	46	210000	car	yes	downtown
2	25	75000	truck	yes	suburbs
3	36	30000	car	yes	suburbs
4	29	55000	none	no	suburbs
5	54	430000	car	yes	downtown

2. Feature Encoding#

After having cleaned your data, you must encode it in a way such that the ML algorithm can consume it.One important thing you must do is encode complex data types, like strings or categorical variables, in a numeric format.

We will illustrate feature encoding on the dataset above, where the three independent variables are Income, Vehicle, and Kids, each of which are categorical variables, and the target variable is a person’s Residence (whether a person lives in downtown or in suburbs).

2.1. Apply One-Hot Encoding to “Vehicle” and “Kids”#

# One-hot encode the 'Vehicle' and 'Kids' columns
ohe = OneHotEncoder(sparse=False)
encoded_features = pd.DataFrame(ohe.fit_transform(df[['Vehicle', 'Kids']]))

# Get new column names from OneHotEncoder
encoded_features.columns = ohe.get_feature_names_out(['Vehicle', 'Kids'])

# Concatenate the encoded features back to the original DataFrame
df_encoded = pd.concat([df, encoded_features], axis=1).drop(['Vehicle', 'Kids'], axis=1)
df_encoded

/Users/andrewwolf/.pyenv/versions/3.10.7/lib/python3.10/site-packages/sklearn/preprocessing/_encoders.py:808: FutureWarning: `sparse` was renamed to `sparse_output` in version 1.2 and will be removed in 1.4. `sparse_output` is ignored unless you leave `sparse` to its default value.
  warnings.warn(

	Age	Income (€)	Residence	Vehicle_car	Vehicle_none	Vehicle_truck	Kids_no	Kids_yes
0	32	95000	downtown	0.0	1.0	0.0	1.0	0.0
1	46	210000	downtown	1.0	0.0	0.0	0.0	1.0
2	25	75000	suburbs	0.0	0.0	1.0	0.0	1.0
3	36	30000	suburbs	1.0	0.0	0.0	0.0	1.0
4	29	55000	suburbs	0.0	1.0	0.0	1.0	0.0
5	54	430000	downtown	1.0	0.0	0.0	0.0	1.0

2.2. Apply Label Encoding to “Residence”#

# Label encode the 'Residence' column
le = LabelEncoder()
df_encoded['Residence'] = le.fit_transform(df_encoded['Residence'])

df_encoded

	Age	Income (€)	Residence	Vehicle_car	Vehicle_none	Vehicle_truck	Kids_no	Kids_yes
0	32	95000	0	0.0	1.0	0.0	1.0	0.0
1	46	210000	0	1.0	0.0	0.0	0.0	1.0
2	25	75000	1	0.0	0.0	1.0	0.0	1.0
3	36	30000	1	1.0	0.0	0.0	0.0	1.0
4	29	55000	1	0.0	1.0	0.0	1.0	0.0
5	54	430000	0	1.0	0.0	0.0	0.0	1.0

3. Feature Scaling#

Many datasets contain numeric features with significantly different numeric scales.

For example, the Age feature ranges from 27 to 54 (years), while the Income feature ranges from $30,000$ EUR to $430,000$ EUR, while the features Vehicle_none, Vehicle_car, Vehicle_truck, Kids_yes and Kids_no all have the range from $0$ to $1$.

Unscaled data will, technically, not prohibit the ML algorithm from running, but can often lead to problems in the learning algorithm.

For example, since the Income feature has much larger value than the other features, it will influence the target variable much more. However, some ML models like decision trees are invariant to feature scaling.

But we don’t necessarily want this to be the case.

To ensure that the measurement scale doesn’t adversely affect our learning algorithm, we scale, or normalize, each feature to a common range of values. Here is an example Python code that demonstrates how to scale your features using StandardScaler:

# Initialize the StandardScaler
scaler = StandardScaler()

# List of all the columns
df_encoded.columns

Index(['Age', 'Income (€)', 'Residence', 'Vehicle_car', 'Vehicle_none',
       'Vehicle_truck', 'Kids_no', 'Kids_yes'],
      dtype='object')

# List of columns to scale (we take all but| target variable)
columns_to_scale = ['Age', 'Income (€)', 'Vehicle_car', 'Vehicle_none',
       'Vehicle_truck', 'Kids_no', 'Kids_yes']

# Fit the scaler to the data and transform it
df_encoded[columns_to_scale] = scaler.fit_transform(df_encoded[columns_to_scale])

df_encoded

	Age	Income (€)	Residence	Vehicle_car	Vehicle_none	Vehicle_truck	Kids_no	Kids_yes
0	-0.498342	-0.392844	0	-1.0	1.414214	-0.447214	1.414214	-1.414214
1	0.897015	0.441194	0	1.0	-0.707107	-0.447214	-0.707107	0.707107
2	-1.196020	-0.537894	1	-1.0	-0.707107	2.236068	-0.707107	0.707107
3	-0.099668	-0.864257	1	1.0	-0.707107	-0.447214	-0.707107	0.707107
4	-0.797347	-0.682945	1	-1.0	1.414214	-0.447214	1.414214	-1.414214
5	1.694362	2.036746	0	1.0	-0.707107	-0.447214	-0.707107	0.707107

4. Feature Binning#

Feature binning is the process that converts a numerical (either continuous and discrete) feature into a categorical feature represented by a set of ranges, or bins.

For example, instead of representing age as a single real-valued feature, we chop ranges of age into 3 discrete bins: $$ \begin{equation*} young \in [ages \ 25 - 34], \qquad middle \in [ages \ 35 - 44], \qquad old \in [ages \ 45 - 54] \end{equation*} $$

4.1. General Approach#

To implement feature binning or discretization of the “Age” variable into categorical bins using Python, you can use the pandas library which provides a straightforward method called cut for binning continuous variables. Here’s how you can convert the age into three categories based on the provided ranges:

df

	Age	Income (€)	Vehicle	Kids	Residence
0	32	95000	none	no	downtown
1	46	210000	car	yes	downtown
2	25	75000	truck	yes	suburbs
3	36	30000	car	yes	suburbs
4	29	55000	none	no	suburbs
5	54	430000	car	yes	downtown

# Define bins and their labels
bins = [24, 34, 44, 54]  # Extend ranges to include all possible ages in each group
labels = ['Young', 'Middle', 'Old']

# Perform binning
df['Age Group'] = pd.cut(df['Age'], bins=bins, labels=labels, right=True)  # right=True means inclusive on the right

df

	Age	Income (€)	Vehicle	Kids	Residence	Age Group
0	32	95000	none	no	downtown	Young
1	46	210000	car	yes	downtown	Old
2	25	75000	truck	yes	suburbs	Young
3	36	30000	car	yes	suburbs	Middle
4	29	55000	none	no	suburbs	Young
5	54	430000	car	yes	downtown	Old

This approach allows for clear and meaningful categorization of ages, which can be very useful for analysis or as a feature in machine learning models where age is an important factor.

4.2. Equal Width Binning#

Equal width binning divides the range of values of a feature into bins with equal width.

Usually, we specify the number of bins as a hyper-parameter $K$, and then compute the width of each bin as

\[ \begin{equation} w = \Big[\frac{max^{(j)} - min^{(j)}}{K}\Big] \end{equation} \]

where $max^{(j)}$ and $min^{(j)}$ are the $j^{th}$ feature’s maximum and minimum values, respectively.

The ranges of the $K$ bins are then $$ \begin{equation} \begin{split} Bin \ 1&: [min, \ min + w - 1] \\ Bin \ 2&: [min+w, \ min + 2\cdot w - 1] \\ ... \\ Bin \ K&: [min + (K-1)\cdot w, \ max] \label{eq:equal_width_binning} \end{split} \end{equation} $$

As an example of equal width binning, consider splitting the Age feature in the Amsterdam demographics dataset into $K=3$ bins.

The bin’s width is: $$ \begin{equation*} w = \Big[\frac{max-min}{x}\Big] = \Big[\frac{54-25}{3}\Big] = 9.7 \approx 10 \end{equation*} $$

which we rounded to the nearest integer because Age values are always integers (in this dataset).

To implement equal width binning in Python and calculate each bin’s range for the “Age” feature of the Amsterdam demographics dataset using $K=3$ bins, we can use the numpy library to help with calculations and then use pandas for binning. Here’s how you can perform this task:

# Number of bins
K = 3

# Calculate the width of each bin
min_age = df['Age'].min()
max_age = df['Age'].max()
width = (max_age - min_age) // K

# Define bins using calculated width
bins = np.linspace(min_age, max_age, num=K+1)

print(bins)

[25.         34.66666667 44.33333333 54.        ]

# Create bin labels
labels = [f'Bin {i+1}' for i in range(K)]

print(labels)

['Bin 1', 'Bin 2', 'Bin 3']

# Perform binning
df['Age Bin'] = pd.cut(df['Age'], bins=bins, labels=labels, include_lowest=True, right=True)

df

	Age	Income (€)	Vehicle	Kids	Residence	Age Group	Age Bin
0	32	95000	none	no	downtown	Young	Bin 1
1	46	210000	car	yes	downtown	Old	Bin 3
2	25	75000	truck	yes	suburbs	Young	Bin 1
3	36	30000	car	yes	suburbs	Middle	Bin 2
4	29	55000	none	no	suburbs	Young	Bin 1
5	54	430000	car	yes	downtown	Old	Bin 3