{
"cells": [
{
"cell_type": "markdown",
"id": "af83d9f4",
"metadata": {},
"source": [
"(chapter7_part2)=\n",
"\n",
"# Feature Transformation & Binning\n",
"\n",
"- This is a supplement material for the [Machine Learning Simplified](https://themlsbook.com) book. It sheds light on Python implementations of the topics discussed while all detailed explanations can be found in the book. \n",
"- I also assume you know Python syntax and how it works. If you don't, I highly recommend you to take a break and get introduced to the language before going forward with my code. \n",
"- This material can be downloaded as a Jupyter notebook (Download button in the upper-right corner -> `.ipynb`) to reproduce the code and play around with it. \n",
"\n",
"\n",
"This notebook is a supplement for *Chapter 7. Data Preparation* of **Machine Learning For Everyone** book.\n",
"\n",
"## 1. Required Libraries, Data & Variables\n",
"\n",
"Let's import the data and have a look at it:"
]
},
{
"cell_type": "code",
"execution_count": 1,
"id": "3de67771",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" Age | \n",
" Income (€) | \n",
" Vehicle | \n",
" Kids | \n",
" Residence | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 32 | \n",
" 95000 | \n",
" none | \n",
" no | \n",
" downtown | \n",
"
\n",
" \n",
" 1 | \n",
" 46 | \n",
" 210000 | \n",
" car | \n",
" yes | \n",
" downtown | \n",
"
\n",
" \n",
" 2 | \n",
" 25 | \n",
" 75000 | \n",
" truck | \n",
" yes | \n",
" suburbs | \n",
"
\n",
" \n",
" 3 | \n",
" 36 | \n",
" 30000 | \n",
" car | \n",
" yes | \n",
" suburbs | \n",
"
\n",
" \n",
" 4 | \n",
" 29 | \n",
" 55000 | \n",
" none | \n",
" no | \n",
" suburbs | \n",
"
\n",
" \n",
" 5 | \n",
" 54 | \n",
" 430000 | \n",
" car | \n",
" yes | \n",
" downtown | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" Age Income (€) Vehicle Kids Residence\n",
"0 32 95000 none no downtown\n",
"1 46 210000 car yes downtown\n",
"2 25 75000 truck yes suburbs\n",
"3 36 30000 car yes suburbs\n",
"4 29 55000 none no suburbs\n",
"5 54 430000 car yes downtown"
]
},
"execution_count": 1,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"import pandas as pd\n",
"import numpy as np\n",
"from sklearn.preprocessing import OneHotEncoder, LabelEncoder, StandardScaler\n",
"\n",
"\n",
"# Define the data as a dictionary\n",
"data = {\n",
" \"Age\": [32, 46, 25, 36, 29, 54],\n",
" \"Income (€)\": [95000, 210000, 75000, 30000, 55000, 430000],\n",
" \"Vehicle\": [\"none\", \"car\", \"truck\", \"car\", \"none\", \"car\"],\n",
" \"Kids\": [\"no\", \"yes\", \"yes\", \"yes\", \"no\", \"yes\"],\n",
" \"Residence\": [\"downtown\", \"downtown\", \"suburbs\", \"suburbs\", \"suburbs\", \"downtown\"]\n",
"}\n",
"\n",
"# Create DataFrame from the dictionary\n",
"df = pd.DataFrame(data)\n",
"\n",
"# Print the DataFrame\n",
"df"
]
},
{
"cell_type": "markdown",
"id": "eb1d9106",
"metadata": {},
"source": [
"## 2. Feature Encoding\n",
"\n",
"After having cleaned your data, you must encode it in a way such that the ML algorithm can consume it.One important thing you must do is encode complex data types, like strings or categorical variables, in a numeric format.\n",
"\n",
"We will illustrate feature encoding on the dataset above, where the three independent variables are Income, Vehicle, and Kids, each of which are categorical variables, and the target variable is a person's Residence (whether a person lives in downtown or in suburbs).\n",
"\n",
"### 2.1. Apply One-Hot Encoding to \"Vehicle\" and \"Kids\""
]
},
{
"cell_type": "code",
"execution_count": 2,
"id": "25afb813",
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"/Users/andrewwolf/.pyenv/versions/3.10.7/lib/python3.10/site-packages/sklearn/preprocessing/_encoders.py:808: FutureWarning: `sparse` was renamed to `sparse_output` in version 1.2 and will be removed in 1.4. `sparse_output` is ignored unless you leave `sparse` to its default value.\n",
" warnings.warn(\n"
]
},
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" Age | \n",
" Income (€) | \n",
" Residence | \n",
" Vehicle_car | \n",
" Vehicle_none | \n",
" Vehicle_truck | \n",
" Kids_no | \n",
" Kids_yes | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 32 | \n",
" 95000 | \n",
" downtown | \n",
" 0.0 | \n",
" 1.0 | \n",
" 0.0 | \n",
" 1.0 | \n",
" 0.0 | \n",
"
\n",
" \n",
" 1 | \n",
" 46 | \n",
" 210000 | \n",
" downtown | \n",
" 1.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 1.0 | \n",
"
\n",
" \n",
" 2 | \n",
" 25 | \n",
" 75000 | \n",
" suburbs | \n",
" 0.0 | \n",
" 0.0 | \n",
" 1.0 | \n",
" 0.0 | \n",
" 1.0 | \n",
"
\n",
" \n",
" 3 | \n",
" 36 | \n",
" 30000 | \n",
" suburbs | \n",
" 1.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 1.0 | \n",
"
\n",
" \n",
" 4 | \n",
" 29 | \n",
" 55000 | \n",
" suburbs | \n",
" 0.0 | \n",
" 1.0 | \n",
" 0.0 | \n",
" 1.0 | \n",
" 0.0 | \n",
"
\n",
" \n",
" 5 | \n",
" 54 | \n",
" 430000 | \n",
" downtown | \n",
" 1.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 1.0 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" Age Income (€) Residence Vehicle_car Vehicle_none Vehicle_truck \\\n",
"0 32 95000 downtown 0.0 1.0 0.0 \n",
"1 46 210000 downtown 1.0 0.0 0.0 \n",
"2 25 75000 suburbs 0.0 0.0 1.0 \n",
"3 36 30000 suburbs 1.0 0.0 0.0 \n",
"4 29 55000 suburbs 0.0 1.0 0.0 \n",
"5 54 430000 downtown 1.0 0.0 0.0 \n",
"\n",
" Kids_no Kids_yes \n",
"0 1.0 0.0 \n",
"1 0.0 1.0 \n",
"2 0.0 1.0 \n",
"3 0.0 1.0 \n",
"4 1.0 0.0 \n",
"5 0.0 1.0 "
]
},
"execution_count": 2,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# One-hot encode the 'Vehicle' and 'Kids' columns\n",
"ohe = OneHotEncoder(sparse=False)\n",
"encoded_features = pd.DataFrame(ohe.fit_transform(df[['Vehicle', 'Kids']]))\n",
"\n",
"# Get new column names from OneHotEncoder\n",
"encoded_features.columns = ohe.get_feature_names_out(['Vehicle', 'Kids'])\n",
"\n",
"# Concatenate the encoded features back to the original DataFrame\n",
"df_encoded = pd.concat([df, encoded_features], axis=1).drop(['Vehicle', 'Kids'], axis=1)\n",
"df_encoded"
]
},
{
"cell_type": "markdown",
"id": "878ad0bb",
"metadata": {},
"source": [
"### 2.2. Apply Label Encoding to \"Residence\""
]
},
{
"cell_type": "code",
"execution_count": 3,
"id": "9cb26968",
"metadata": {},
"outputs": [],
"source": [
"# Label encode the 'Residence' column\n",
"le = LabelEncoder()\n",
"df_encoded['Residence'] = le.fit_transform(df_encoded['Residence'])"
]
},
{
"cell_type": "code",
"execution_count": 4,
"id": "8f0f004a",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" Age | \n",
" Income (€) | \n",
" Residence | \n",
" Vehicle_car | \n",
" Vehicle_none | \n",
" Vehicle_truck | \n",
" Kids_no | \n",
" Kids_yes | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 32 | \n",
" 95000 | \n",
" 0 | \n",
" 0.0 | \n",
" 1.0 | \n",
" 0.0 | \n",
" 1.0 | \n",
" 0.0 | \n",
"
\n",
" \n",
" 1 | \n",
" 46 | \n",
" 210000 | \n",
" 0 | \n",
" 1.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 1.0 | \n",
"
\n",
" \n",
" 2 | \n",
" 25 | \n",
" 75000 | \n",
" 1 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 1.0 | \n",
" 0.0 | \n",
" 1.0 | \n",
"
\n",
" \n",
" 3 | \n",
" 36 | \n",
" 30000 | \n",
" 1 | \n",
" 1.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 1.0 | \n",
"
\n",
" \n",
" 4 | \n",
" 29 | \n",
" 55000 | \n",
" 1 | \n",
" 0.0 | \n",
" 1.0 | \n",
" 0.0 | \n",
" 1.0 | \n",
" 0.0 | \n",
"
\n",
" \n",
" 5 | \n",
" 54 | \n",
" 430000 | \n",
" 0 | \n",
" 1.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 1.0 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" Age Income (€) Residence Vehicle_car Vehicle_none Vehicle_truck \\\n",
"0 32 95000 0 0.0 1.0 0.0 \n",
"1 46 210000 0 1.0 0.0 0.0 \n",
"2 25 75000 1 0.0 0.0 1.0 \n",
"3 36 30000 1 1.0 0.0 0.0 \n",
"4 29 55000 1 0.0 1.0 0.0 \n",
"5 54 430000 0 1.0 0.0 0.0 \n",
"\n",
" Kids_no Kids_yes \n",
"0 1.0 0.0 \n",
"1 0.0 1.0 \n",
"2 0.0 1.0 \n",
"3 0.0 1.0 \n",
"4 1.0 0.0 \n",
"5 0.0 1.0 "
]
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df_encoded"
]
},
{
"cell_type": "markdown",
"id": "ccd5c2f3",
"metadata": {},
"source": [
"## 3. Feature Scaling\n",
"\n",
"Many datasets contain numeric features with significantly different numeric scales.\n",
"\n",
"For example, the Age feature ranges from 27 to 54 (years), while the Income feature ranges from $30,000$ EUR to $430,000$ EUR, while the features Vehicle\\_none, Vehicle\\_car, Vehicle\\_truck, Kids\\_yes and Kids\\_no all have the range from $0$ to $1$.\n",
"\n",
"Unscaled data will, technically, not prohibit the ML algorithm from running, but can often lead to problems in the learning algorithm.\n",
"\n",
"For example, since the Income feature has much larger value than the other features, it will influence the target variable much more. However, some ML models like decision trees are invariant to feature scaling.\n",
"\n",
"But we don't necessarily want this to be the case.\n",
"\n",
"To ensure that the measurement scale doesn't adversely affect our learning algorithm, we scale, or normalize, each feature to a common range of values. Here is an example Python code that demonstrates how to scale your features using `StandardScaler`:"
]
},
{
"cell_type": "code",
"execution_count": 5,
"id": "58272479",
"metadata": {},
"outputs": [],
"source": [
"# Initialize the StandardScaler\n",
"scaler = StandardScaler()"
]
},
{
"cell_type": "code",
"execution_count": 6,
"id": "3b418dfc",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"Index(['Age', 'Income (€)', 'Residence', 'Vehicle_car', 'Vehicle_none',\n",
" 'Vehicle_truck', 'Kids_no', 'Kids_yes'],\n",
" dtype='object')"
]
},
"execution_count": 6,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# List of all the columns\n",
"df_encoded.columns"
]
},
{
"cell_type": "code",
"execution_count": 7,
"id": "b80f9021",
"metadata": {},
"outputs": [],
"source": [
"# List of columns to scale (we take all but| target variable)\n",
"columns_to_scale = ['Age', 'Income (€)', 'Vehicle_car', 'Vehicle_none',\n",
" 'Vehicle_truck', 'Kids_no', 'Kids_yes']"
]
},
{
"cell_type": "code",
"execution_count": 8,
"id": "e123e726",
"metadata": {},
"outputs": [],
"source": [
"# Fit the scaler to the data and transform it\n",
"df_encoded[columns_to_scale] = scaler.fit_transform(df_encoded[columns_to_scale])"
]
},
{
"cell_type": "code",
"execution_count": 9,
"id": "98417003",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" Age | \n",
" Income (€) | \n",
" Residence | \n",
" Vehicle_car | \n",
" Vehicle_none | \n",
" Vehicle_truck | \n",
" Kids_no | \n",
" Kids_yes | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" -0.498342 | \n",
" -0.392844 | \n",
" 0 | \n",
" -1.0 | \n",
" 1.414214 | \n",
" -0.447214 | \n",
" 1.414214 | \n",
" -1.414214 | \n",
"
\n",
" \n",
" 1 | \n",
" 0.897015 | \n",
" 0.441194 | \n",
" 0 | \n",
" 1.0 | \n",
" -0.707107 | \n",
" -0.447214 | \n",
" -0.707107 | \n",
" 0.707107 | \n",
"
\n",
" \n",
" 2 | \n",
" -1.196020 | \n",
" -0.537894 | \n",
" 1 | \n",
" -1.0 | \n",
" -0.707107 | \n",
" 2.236068 | \n",
" -0.707107 | \n",
" 0.707107 | \n",
"
\n",
" \n",
" 3 | \n",
" -0.099668 | \n",
" -0.864257 | \n",
" 1 | \n",
" 1.0 | \n",
" -0.707107 | \n",
" -0.447214 | \n",
" -0.707107 | \n",
" 0.707107 | \n",
"
\n",
" \n",
" 4 | \n",
" -0.797347 | \n",
" -0.682945 | \n",
" 1 | \n",
" -1.0 | \n",
" 1.414214 | \n",
" -0.447214 | \n",
" 1.414214 | \n",
" -1.414214 | \n",
"
\n",
" \n",
" 5 | \n",
" 1.694362 | \n",
" 2.036746 | \n",
" 0 | \n",
" 1.0 | \n",
" -0.707107 | \n",
" -0.447214 | \n",
" -0.707107 | \n",
" 0.707107 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" Age Income (€) Residence Vehicle_car Vehicle_none Vehicle_truck \\\n",
"0 -0.498342 -0.392844 0 -1.0 1.414214 -0.447214 \n",
"1 0.897015 0.441194 0 1.0 -0.707107 -0.447214 \n",
"2 -1.196020 -0.537894 1 -1.0 -0.707107 2.236068 \n",
"3 -0.099668 -0.864257 1 1.0 -0.707107 -0.447214 \n",
"4 -0.797347 -0.682945 1 -1.0 1.414214 -0.447214 \n",
"5 1.694362 2.036746 0 1.0 -0.707107 -0.447214 \n",
"\n",
" Kids_no Kids_yes \n",
"0 1.414214 -1.414214 \n",
"1 -0.707107 0.707107 \n",
"2 -0.707107 0.707107 \n",
"3 -0.707107 0.707107 \n",
"4 1.414214 -1.414214 \n",
"5 -0.707107 0.707107 "
]
},
"execution_count": 9,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df_encoded"
]
},
{
"cell_type": "markdown",
"id": "3cc89843",
"metadata": {},
"source": [
"## 4. Feature Binning\n",
"\n",
"Feature binning is the process that converts a numerical (either continuous and discrete) feature into a categorical feature represented by a set of ranges, or bins.\n",
"\n",
"For example, instead of representing age as a single real-valued feature, we chop ranges of age into 3 discrete bins:\n",
"$$\n",
"\\begin{equation*}\n",
" young \\in [ages \\ 25 - 34], \\qquad\n",
" middle \\in [ages \\ 35 - 44], \\qquad\n",
" old \\in [ages \\ 45 - 54]\n",
"\\end{equation*}\n",
"$$\n",
"\n",
"### 4.1. General Approach\n",
"To implement feature binning or discretization of the \"Age\" variable into categorical bins using Python, you can use the `pandas` library which provides a straightforward method called cut for binning continuous variables. Here's how you can convert the age into three categories based on the provided ranges:"
]
},
{
"cell_type": "code",
"execution_count": 10,
"id": "64b863a7",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" Age | \n",
" Income (€) | \n",
" Vehicle | \n",
" Kids | \n",
" Residence | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 32 | \n",
" 95000 | \n",
" none | \n",
" no | \n",
" downtown | \n",
"
\n",
" \n",
" 1 | \n",
" 46 | \n",
" 210000 | \n",
" car | \n",
" yes | \n",
" downtown | \n",
"
\n",
" \n",
" 2 | \n",
" 25 | \n",
" 75000 | \n",
" truck | \n",
" yes | \n",
" suburbs | \n",
"
\n",
" \n",
" 3 | \n",
" 36 | \n",
" 30000 | \n",
" car | \n",
" yes | \n",
" suburbs | \n",
"
\n",
" \n",
" 4 | \n",
" 29 | \n",
" 55000 | \n",
" none | \n",
" no | \n",
" suburbs | \n",
"
\n",
" \n",
" 5 | \n",
" 54 | \n",
" 430000 | \n",
" car | \n",
" yes | \n",
" downtown | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" Age Income (€) Vehicle Kids Residence\n",
"0 32 95000 none no downtown\n",
"1 46 210000 car yes downtown\n",
"2 25 75000 truck yes suburbs\n",
"3 36 30000 car yes suburbs\n",
"4 29 55000 none no suburbs\n",
"5 54 430000 car yes downtown"
]
},
"execution_count": 10,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df"
]
},
{
"cell_type": "code",
"execution_count": 11,
"id": "b1e05315",
"metadata": {},
"outputs": [],
"source": [
"# Define bins and their labels\n",
"bins = [24, 34, 44, 54] # Extend ranges to include all possible ages in each group\n",
"labels = ['Young', 'Middle', 'Old']\n",
"\n",
"# Perform binning\n",
"df['Age Group'] = pd.cut(df['Age'], bins=bins, labels=labels, right=True) # right=True means inclusive on the right"
]
},
{
"cell_type": "code",
"execution_count": 12,
"id": "0900e0db",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" Age | \n",
" Income (€) | \n",
" Vehicle | \n",
" Kids | \n",
" Residence | \n",
" Age Group | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 32 | \n",
" 95000 | \n",
" none | \n",
" no | \n",
" downtown | \n",
" Young | \n",
"
\n",
" \n",
" 1 | \n",
" 46 | \n",
" 210000 | \n",
" car | \n",
" yes | \n",
" downtown | \n",
" Old | \n",
"
\n",
" \n",
" 2 | \n",
" 25 | \n",
" 75000 | \n",
" truck | \n",
" yes | \n",
" suburbs | \n",
" Young | \n",
"
\n",
" \n",
" 3 | \n",
" 36 | \n",
" 30000 | \n",
" car | \n",
" yes | \n",
" suburbs | \n",
" Middle | \n",
"
\n",
" \n",
" 4 | \n",
" 29 | \n",
" 55000 | \n",
" none | \n",
" no | \n",
" suburbs | \n",
" Young | \n",
"
\n",
" \n",
" 5 | \n",
" 54 | \n",
" 430000 | \n",
" car | \n",
" yes | \n",
" downtown | \n",
" Old | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" Age Income (€) Vehicle Kids Residence Age Group\n",
"0 32 95000 none no downtown Young\n",
"1 46 210000 car yes downtown Old\n",
"2 25 75000 truck yes suburbs Young\n",
"3 36 30000 car yes suburbs Middle\n",
"4 29 55000 none no suburbs Young\n",
"5 54 430000 car yes downtown Old"
]
},
"execution_count": 12,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df"
]
},
{
"cell_type": "markdown",
"id": "f620b82d",
"metadata": {},
"source": [
"This approach allows for clear and meaningful categorization of ages, which can be very useful for analysis or as a feature in machine learning models where age is an important factor.\n",
"\n",
"### 4.2. Equal Width Binning\n",
"\n",
"Equal width binning divides the range of values of a feature into bins with equal width.\n",
"\n",
"Usually, we specify the number of bins as a hyper-parameter $K$, and then compute the width of each bin as\n",
"\n",
"$$\n",
"\\begin{equation}\n",
" w = \\Big[\\frac{max^{(j)} - min^{(j)}}{K}\\Big]\n",
"\\end{equation}\n",
"$$\n",
"where \n",
"$max^{(j)}$ and $min^{(j)}$ are the $j^{th}$ feature's maximum and minimum values, respectively.\n",
"\n",
"\n",
"The ranges of the $K$ bins are then\n",
"$$\n",
"\\begin{equation}\n",
"\\begin{split}\n",
" Bin \\ 1&: [min, \\ min + w - 1] \\\\\n",
" Bin \\ 2&: [min+w, \\ min + 2\\cdot w - 1] \\\\\n",
" ... \\\\\n",
" Bin \\ K&: [min + (K-1)\\cdot w, \\ max]\n",
" \\label{eq:equal_width_binning}\n",
"\\end{split}\n",
"\\end{equation}\n",
"$$\n",
"\n",
"As an example of equal width binning, consider splitting the Age feature in the Amsterdam demographics dataset into $K=3$ bins.\n",
"\n",
"The bin's width is:\n",
"$$\n",
"\\begin{equation*}\n",
" w = \\Big[\\frac{max-min}{x}\\Big] = \\Big[\\frac{54-25}{3}\\Big] = 9.7 \\approx 10\n",
"\\end{equation*}\n",
"$$\n",
"\n",
"which we rounded to the nearest integer because Age values are always integers (in this dataset).\n",
"\n",
"To implement equal width binning in Python and calculate each bin's range for the \"Age\" feature of the Amsterdam demographics dataset using $K=3$ bins, we can use the `numpy` library to help with calculations and then use `pandas` for binning. Here's how you can perform this task:"
]
},
{
"cell_type": "code",
"execution_count": 13,
"id": "1b109d0b",
"metadata": {},
"outputs": [],
"source": [
"# Number of bins\n",
"K = 3\n",
"\n",
"# Calculate the width of each bin\n",
"min_age = df['Age'].min()\n",
"max_age = df['Age'].max()\n",
"width = (max_age - min_age) // K"
]
},
{
"cell_type": "code",
"execution_count": 14,
"id": "4526a8b6",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"[25. 34.66666667 44.33333333 54. ]\n"
]
}
],
"source": [
"# Define bins using calculated width\n",
"bins = np.linspace(min_age, max_age, num=K+1)\n",
"\n",
"print(bins)"
]
},
{
"cell_type": "code",
"execution_count": 15,
"id": "e74a40e4",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"['Bin 1', 'Bin 2', 'Bin 3']\n"
]
}
],
"source": [
"# Create bin labels\n",
"labels = [f'Bin {i+1}' for i in range(K)]\n",
"\n",
"print(labels)"
]
},
{
"cell_type": "code",
"execution_count": 16,
"id": "63bc22fe",
"metadata": {},
"outputs": [],
"source": [
"# Perform binning\n",
"df['Age Bin'] = pd.cut(df['Age'], bins=bins, labels=labels, include_lowest=True, right=True)"
]
},
{
"cell_type": "code",
"execution_count": 17,
"id": "0f9006aa",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" Age | \n",
" Income (€) | \n",
" Vehicle | \n",
" Kids | \n",
" Residence | \n",
" Age Group | \n",
" Age Bin | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 32 | \n",
" 95000 | \n",
" none | \n",
" no | \n",
" downtown | \n",
" Young | \n",
" Bin 1 | \n",
"
\n",
" \n",
" 1 | \n",
" 46 | \n",
" 210000 | \n",
" car | \n",
" yes | \n",
" downtown | \n",
" Old | \n",
" Bin 3 | \n",
"
\n",
" \n",
" 2 | \n",
" 25 | \n",
" 75000 | \n",
" truck | \n",
" yes | \n",
" suburbs | \n",
" Young | \n",
" Bin 1 | \n",
"
\n",
" \n",
" 3 | \n",
" 36 | \n",
" 30000 | \n",
" car | \n",
" yes | \n",
" suburbs | \n",
" Middle | \n",
" Bin 2 | \n",
"
\n",
" \n",
" 4 | \n",
" 29 | \n",
" 55000 | \n",
" none | \n",
" no | \n",
" suburbs | \n",
" Young | \n",
" Bin 1 | \n",
"
\n",
" \n",
" 5 | \n",
" 54 | \n",
" 430000 | \n",
" car | \n",
" yes | \n",
" downtown | \n",
" Old | \n",
" Bin 3 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" Age Income (€) Vehicle Kids Residence Age Group Age Bin\n",
"0 32 95000 none no downtown Young Bin 1\n",
"1 46 210000 car yes downtown Old Bin 3\n",
"2 25 75000 truck yes suburbs Young Bin 1\n",
"3 36 30000 car yes suburbs Middle Bin 2\n",
"4 29 55000 none no suburbs Young Bin 1\n",
"5 54 430000 car yes downtown Old Bin 3"
]
},
"execution_count": 17,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df"
]
}
],
"metadata": {
"jupytext": {
"formats": "md:myst",
"text_representation": {
"extension": ".md",
"format_name": "myst"
}
},
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.10.7"
},
"source_map": [
11,
29,
49,
61,
72,
78,
85,
87,
105,
111,
116,
123,
129,
131,
152,
157,
167,
169,
217,
228,
236,
243,
249
]
},
"nbformat": 4,
"nbformat_minor": 5
}