{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "af83d9f4",
   "metadata": {},
   "source": [
    "(chapter7_part2)=\n",
    "\n",
    "# Feature Transformation & Binning\n",
    "\n",
    "- This is a supplement material for the [Machine Learning Simplified](https://themlsbook.com) book. It sheds light on Python implementations of the topics discussed while all detailed explanations can be found in the book. \n",
    "- I also assume you know Python syntax and how it works. If you don't, I highly recommend you to take a break and get introduced to the language before going forward with my code. \n",
    "- This material can be downloaded as a Jupyter notebook (Download button in the upper-right corner -> `.ipynb`) to reproduce the code and play around with it. \n",
    "\n",
    "\n",
    "This notebook is a supplement for *Chapter 7. Data Preparation* of **Machine Learning For Everyone** book.\n",
    "\n",
    "## 1. Required Libraries, Data & Variables\n",
    "\n",
    "Let's import the data and have a look at it:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "id": "3de67771",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Age</th>\n",
       "      <th>Income (€)</th>\n",
       "      <th>Vehicle</th>\n",
       "      <th>Kids</th>\n",
       "      <th>Residence</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>32</td>\n",
       "      <td>95000</td>\n",
       "      <td>none</td>\n",
       "      <td>no</td>\n",
       "      <td>downtown</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>46</td>\n",
       "      <td>210000</td>\n",
       "      <td>car</td>\n",
       "      <td>yes</td>\n",
       "      <td>downtown</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>25</td>\n",
       "      <td>75000</td>\n",
       "      <td>truck</td>\n",
       "      <td>yes</td>\n",
       "      <td>suburbs</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>36</td>\n",
       "      <td>30000</td>\n",
       "      <td>car</td>\n",
       "      <td>yes</td>\n",
       "      <td>suburbs</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>29</td>\n",
       "      <td>55000</td>\n",
       "      <td>none</td>\n",
       "      <td>no</td>\n",
       "      <td>suburbs</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>54</td>\n",
       "      <td>430000</td>\n",
       "      <td>car</td>\n",
       "      <td>yes</td>\n",
       "      <td>downtown</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   Age  Income (€) Vehicle Kids Residence\n",
       "0   32       95000    none   no  downtown\n",
       "1   46      210000     car  yes  downtown\n",
       "2   25       75000   truck  yes   suburbs\n",
       "3   36       30000     car  yes   suburbs\n",
       "4   29       55000    none   no   suburbs\n",
       "5   54      430000     car  yes  downtown"
      ]
     },
     "execution_count": 1,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "import pandas as pd\n",
    "import numpy as np\n",
    "from sklearn.preprocessing import OneHotEncoder, LabelEncoder, StandardScaler\n",
    "\n",
    "\n",
    "# Define the data as a dictionary\n",
    "data = {\n",
    "    \"Age\": [32, 46, 25, 36, 29, 54],\n",
    "    \"Income (€)\": [95000, 210000, 75000, 30000, 55000, 430000],\n",
    "    \"Vehicle\": [\"none\", \"car\", \"truck\", \"car\", \"none\", \"car\"],\n",
    "    \"Kids\": [\"no\", \"yes\", \"yes\", \"yes\", \"no\", \"yes\"],\n",
    "    \"Residence\": [\"downtown\", \"downtown\", \"suburbs\", \"suburbs\", \"suburbs\", \"downtown\"]\n",
    "}\n",
    "\n",
    "# Create DataFrame from the dictionary\n",
    "df = pd.DataFrame(data)\n",
    "\n",
    "# Print the DataFrame\n",
    "df"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "eb1d9106",
   "metadata": {},
   "source": [
    "## 2. Feature Encoding\n",
    "\n",
    "After having cleaned your data, you must encode it in a way such that the ML algorithm can consume it.One important thing you must do is encode complex data types, like strings or categorical variables, in a numeric format.\n",
    "\n",
    "We will illustrate feature encoding on the dataset above, where the  three independent variables are Income, Vehicle, and Kids, each of which are categorical variables, and the target variable is a person's Residence (whether a person lives in downtown or in suburbs).\n",
    "\n",
    "### 2.1. Apply One-Hot Encoding to \"Vehicle\" and \"Kids\""
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "id": "25afb813",
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "/Users/andrewwolf/.pyenv/versions/3.10.7/lib/python3.10/site-packages/sklearn/preprocessing/_encoders.py:808: FutureWarning: `sparse` was renamed to `sparse_output` in version 1.2 and will be removed in 1.4. `sparse_output` is ignored unless you leave `sparse` to its default value.\n",
      "  warnings.warn(\n"
     ]
    },
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Age</th>\n",
       "      <th>Income (€)</th>\n",
       "      <th>Residence</th>\n",
       "      <th>Vehicle_car</th>\n",
       "      <th>Vehicle_none</th>\n",
       "      <th>Vehicle_truck</th>\n",
       "      <th>Kids_no</th>\n",
       "      <th>Kids_yes</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>32</td>\n",
       "      <td>95000</td>\n",
       "      <td>downtown</td>\n",
       "      <td>0.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>0.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>46</td>\n",
       "      <td>210000</td>\n",
       "      <td>downtown</td>\n",
       "      <td>1.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>1.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>25</td>\n",
       "      <td>75000</td>\n",
       "      <td>suburbs</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>1.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>36</td>\n",
       "      <td>30000</td>\n",
       "      <td>suburbs</td>\n",
       "      <td>1.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>1.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>29</td>\n",
       "      <td>55000</td>\n",
       "      <td>suburbs</td>\n",
       "      <td>0.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>0.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>54</td>\n",
       "      <td>430000</td>\n",
       "      <td>downtown</td>\n",
       "      <td>1.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>1.0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   Age  Income (€) Residence  Vehicle_car  Vehicle_none  Vehicle_truck  \\\n",
       "0   32       95000  downtown          0.0           1.0            0.0   \n",
       "1   46      210000  downtown          1.0           0.0            0.0   \n",
       "2   25       75000   suburbs          0.0           0.0            1.0   \n",
       "3   36       30000   suburbs          1.0           0.0            0.0   \n",
       "4   29       55000   suburbs          0.0           1.0            0.0   \n",
       "5   54      430000  downtown          1.0           0.0            0.0   \n",
       "\n",
       "   Kids_no  Kids_yes  \n",
       "0      1.0       0.0  \n",
       "1      0.0       1.0  \n",
       "2      0.0       1.0  \n",
       "3      0.0       1.0  \n",
       "4      1.0       0.0  \n",
       "5      0.0       1.0  "
      ]
     },
     "execution_count": 2,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# One-hot encode the 'Vehicle' and 'Kids' columns\n",
    "ohe = OneHotEncoder(sparse=False)\n",
    "encoded_features = pd.DataFrame(ohe.fit_transform(df[['Vehicle', 'Kids']]))\n",
    "\n",
    "# Get new column names from OneHotEncoder\n",
    "encoded_features.columns = ohe.get_feature_names_out(['Vehicle', 'Kids'])\n",
    "\n",
    "# Concatenate the encoded features back to the original DataFrame\n",
    "df_encoded = pd.concat([df, encoded_features], axis=1).drop(['Vehicle', 'Kids'], axis=1)\n",
    "df_encoded"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "878ad0bb",
   "metadata": {},
   "source": [
    "### 2.2. Apply Label Encoding to \"Residence\""
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "id": "9cb26968",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Label encode the 'Residence' column\n",
    "le = LabelEncoder()\n",
    "df_encoded['Residence'] = le.fit_transform(df_encoded['Residence'])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "id": "8f0f004a",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Age</th>\n",
       "      <th>Income (€)</th>\n",
       "      <th>Residence</th>\n",
       "      <th>Vehicle_car</th>\n",
       "      <th>Vehicle_none</th>\n",
       "      <th>Vehicle_truck</th>\n",
       "      <th>Kids_no</th>\n",
       "      <th>Kids_yes</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>32</td>\n",
       "      <td>95000</td>\n",
       "      <td>0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>0.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>46</td>\n",
       "      <td>210000</td>\n",
       "      <td>0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>1.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>25</td>\n",
       "      <td>75000</td>\n",
       "      <td>1</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>1.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>36</td>\n",
       "      <td>30000</td>\n",
       "      <td>1</td>\n",
       "      <td>1.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>1.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>29</td>\n",
       "      <td>55000</td>\n",
       "      <td>1</td>\n",
       "      <td>0.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>0.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>54</td>\n",
       "      <td>430000</td>\n",
       "      <td>0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>1.0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   Age  Income (€)  Residence  Vehicle_car  Vehicle_none  Vehicle_truck  \\\n",
       "0   32       95000          0          0.0           1.0            0.0   \n",
       "1   46      210000          0          1.0           0.0            0.0   \n",
       "2   25       75000          1          0.0           0.0            1.0   \n",
       "3   36       30000          1          1.0           0.0            0.0   \n",
       "4   29       55000          1          0.0           1.0            0.0   \n",
       "5   54      430000          0          1.0           0.0            0.0   \n",
       "\n",
       "   Kids_no  Kids_yes  \n",
       "0      1.0       0.0  \n",
       "1      0.0       1.0  \n",
       "2      0.0       1.0  \n",
       "3      0.0       1.0  \n",
       "4      1.0       0.0  \n",
       "5      0.0       1.0  "
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df_encoded"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ccd5c2f3",
   "metadata": {},
   "source": [
    "## 3. Feature Scaling\n",
    "\n",
    "Many datasets contain numeric features with significantly different numeric scales.\n",
    "\n",
    "For example, the Age feature ranges from 27 to 54 (years), while the Income feature ranges from $30,000$ EUR to $430,000$ EUR, while the features Vehicle\\_none, Vehicle\\_car, Vehicle\\_truck, Kids\\_yes and Kids\\_no all have the range from $0$ to $1$.\n",
    "\n",
    "Unscaled data will, technically, not prohibit the ML algorithm from running, but can often lead to problems in the learning algorithm.\n",
    "\n",
    "For example, since the Income feature has much larger value than the other features, it will influence the target variable much more. However, some ML models like decision trees are invariant to feature scaling.\n",
    "\n",
    "But we don't necessarily want this to be the case.\n",
    "\n",
    "To ensure that the measurement scale doesn't adversely affect our learning algorithm, we scale, or normalize, each feature to a common range of values. Here is an example Python code that demonstrates how to scale your features using `StandardScaler`:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "id": "58272479",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Initialize the StandardScaler\n",
    "scaler = StandardScaler()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "id": "3b418dfc",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Index(['Age', 'Income (€)', 'Residence', 'Vehicle_car', 'Vehicle_none',\n",
       "       'Vehicle_truck', 'Kids_no', 'Kids_yes'],\n",
       "      dtype='object')"
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# List of all the columns\n",
    "df_encoded.columns"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "id": "b80f9021",
   "metadata": {},
   "outputs": [],
   "source": [
    "# List of columns to scale (we take all but| target variable)\n",
    "columns_to_scale = ['Age', 'Income (€)', 'Vehicle_car', 'Vehicle_none',\n",
    "       'Vehicle_truck', 'Kids_no', 'Kids_yes']"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "id": "e123e726",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Fit the scaler to the data and transform it\n",
    "df_encoded[columns_to_scale] = scaler.fit_transform(df_encoded[columns_to_scale])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "id": "98417003",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Age</th>\n",
       "      <th>Income (€)</th>\n",
       "      <th>Residence</th>\n",
       "      <th>Vehicle_car</th>\n",
       "      <th>Vehicle_none</th>\n",
       "      <th>Vehicle_truck</th>\n",
       "      <th>Kids_no</th>\n",
       "      <th>Kids_yes</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>-0.498342</td>\n",
       "      <td>-0.392844</td>\n",
       "      <td>0</td>\n",
       "      <td>-1.0</td>\n",
       "      <td>1.414214</td>\n",
       "      <td>-0.447214</td>\n",
       "      <td>1.414214</td>\n",
       "      <td>-1.414214</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>0.897015</td>\n",
       "      <td>0.441194</td>\n",
       "      <td>0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>-0.707107</td>\n",
       "      <td>-0.447214</td>\n",
       "      <td>-0.707107</td>\n",
       "      <td>0.707107</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>-1.196020</td>\n",
       "      <td>-0.537894</td>\n",
       "      <td>1</td>\n",
       "      <td>-1.0</td>\n",
       "      <td>-0.707107</td>\n",
       "      <td>2.236068</td>\n",
       "      <td>-0.707107</td>\n",
       "      <td>0.707107</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>-0.099668</td>\n",
       "      <td>-0.864257</td>\n",
       "      <td>1</td>\n",
       "      <td>1.0</td>\n",
       "      <td>-0.707107</td>\n",
       "      <td>-0.447214</td>\n",
       "      <td>-0.707107</td>\n",
       "      <td>0.707107</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>-0.797347</td>\n",
       "      <td>-0.682945</td>\n",
       "      <td>1</td>\n",
       "      <td>-1.0</td>\n",
       "      <td>1.414214</td>\n",
       "      <td>-0.447214</td>\n",
       "      <td>1.414214</td>\n",
       "      <td>-1.414214</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>1.694362</td>\n",
       "      <td>2.036746</td>\n",
       "      <td>0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>-0.707107</td>\n",
       "      <td>-0.447214</td>\n",
       "      <td>-0.707107</td>\n",
       "      <td>0.707107</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "        Age  Income (€)  Residence  Vehicle_car  Vehicle_none  Vehicle_truck  \\\n",
       "0 -0.498342   -0.392844          0         -1.0      1.414214      -0.447214   \n",
       "1  0.897015    0.441194          0          1.0     -0.707107      -0.447214   \n",
       "2 -1.196020   -0.537894          1         -1.0     -0.707107       2.236068   \n",
       "3 -0.099668   -0.864257          1          1.0     -0.707107      -0.447214   \n",
       "4 -0.797347   -0.682945          1         -1.0      1.414214      -0.447214   \n",
       "5  1.694362    2.036746          0          1.0     -0.707107      -0.447214   \n",
       "\n",
       "    Kids_no  Kids_yes  \n",
       "0  1.414214 -1.414214  \n",
       "1 -0.707107  0.707107  \n",
       "2 -0.707107  0.707107  \n",
       "3 -0.707107  0.707107  \n",
       "4  1.414214 -1.414214  \n",
       "5 -0.707107  0.707107  "
      ]
     },
     "execution_count": 9,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df_encoded"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "3cc89843",
   "metadata": {},
   "source": [
    "## 4. Feature Binning\n",
    "\n",
    "Feature binning is the process that converts a numerical (either continuous and discrete) feature into a categorical feature represented by a set of ranges, or bins.\n",
    "\n",
    "For example, instead of representing age as a single real-valued feature, we chop ranges of age into 3 discrete bins:\n",
    "$$\n",
    "\\begin{equation*}\n",
    "    young \\in [ages \\ 25  - 34],  \\qquad\n",
    "    middle \\in [ages \\ 35  - 44], \\qquad\n",
    "    old \\in [ages \\ 45  - 54]\n",
    "\\end{equation*}\n",
    "$$\n",
    "\n",
    "### 4.1. General Approach\n",
    "To implement feature binning or discretization of the \"Age\" variable into categorical bins using Python, you can use the `pandas` library which provides a straightforward method called cut for binning continuous variables. Here's how you can convert the age into three categories based on the provided ranges:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "id": "64b863a7",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Age</th>\n",
       "      <th>Income (€)</th>\n",
       "      <th>Vehicle</th>\n",
       "      <th>Kids</th>\n",
       "      <th>Residence</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>32</td>\n",
       "      <td>95000</td>\n",
       "      <td>none</td>\n",
       "      <td>no</td>\n",
       "      <td>downtown</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>46</td>\n",
       "      <td>210000</td>\n",
       "      <td>car</td>\n",
       "      <td>yes</td>\n",
       "      <td>downtown</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>25</td>\n",
       "      <td>75000</td>\n",
       "      <td>truck</td>\n",
       "      <td>yes</td>\n",
       "      <td>suburbs</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>36</td>\n",
       "      <td>30000</td>\n",
       "      <td>car</td>\n",
       "      <td>yes</td>\n",
       "      <td>suburbs</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>29</td>\n",
       "      <td>55000</td>\n",
       "      <td>none</td>\n",
       "      <td>no</td>\n",
       "      <td>suburbs</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>54</td>\n",
       "      <td>430000</td>\n",
       "      <td>car</td>\n",
       "      <td>yes</td>\n",
       "      <td>downtown</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   Age  Income (€) Vehicle Kids Residence\n",
       "0   32       95000    none   no  downtown\n",
       "1   46      210000     car  yes  downtown\n",
       "2   25       75000   truck  yes   suburbs\n",
       "3   36       30000     car  yes   suburbs\n",
       "4   29       55000    none   no   suburbs\n",
       "5   54      430000     car  yes  downtown"
      ]
     },
     "execution_count": 10,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "id": "b1e05315",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Define bins and their labels\n",
    "bins = [24, 34, 44, 54]  # Extend ranges to include all possible ages in each group\n",
    "labels = ['Young', 'Middle', 'Old']\n",
    "\n",
    "# Perform binning\n",
    "df['Age Group'] = pd.cut(df['Age'], bins=bins, labels=labels, right=True)  # right=True means inclusive on the right"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "id": "0900e0db",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Age</th>\n",
       "      <th>Income (€)</th>\n",
       "      <th>Vehicle</th>\n",
       "      <th>Kids</th>\n",
       "      <th>Residence</th>\n",
       "      <th>Age Group</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>32</td>\n",
       "      <td>95000</td>\n",
       "      <td>none</td>\n",
       "      <td>no</td>\n",
       "      <td>downtown</td>\n",
       "      <td>Young</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>46</td>\n",
       "      <td>210000</td>\n",
       "      <td>car</td>\n",
       "      <td>yes</td>\n",
       "      <td>downtown</td>\n",
       "      <td>Old</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>25</td>\n",
       "      <td>75000</td>\n",
       "      <td>truck</td>\n",
       "      <td>yes</td>\n",
       "      <td>suburbs</td>\n",
       "      <td>Young</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>36</td>\n",
       "      <td>30000</td>\n",
       "      <td>car</td>\n",
       "      <td>yes</td>\n",
       "      <td>suburbs</td>\n",
       "      <td>Middle</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>29</td>\n",
       "      <td>55000</td>\n",
       "      <td>none</td>\n",
       "      <td>no</td>\n",
       "      <td>suburbs</td>\n",
       "      <td>Young</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>54</td>\n",
       "      <td>430000</td>\n",
       "      <td>car</td>\n",
       "      <td>yes</td>\n",
       "      <td>downtown</td>\n",
       "      <td>Old</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   Age  Income (€) Vehicle Kids Residence Age Group\n",
       "0   32       95000    none   no  downtown     Young\n",
       "1   46      210000     car  yes  downtown       Old\n",
       "2   25       75000   truck  yes   suburbs     Young\n",
       "3   36       30000     car  yes   suburbs    Middle\n",
       "4   29       55000    none   no   suburbs     Young\n",
       "5   54      430000     car  yes  downtown       Old"
      ]
     },
     "execution_count": 12,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "f620b82d",
   "metadata": {},
   "source": [
    "This approach allows for clear and meaningful categorization of ages, which can be very useful for analysis or as a feature in machine learning models where age is an important factor.\n",
    "\n",
    "### 4.2. Equal Width Binning\n",
    "\n",
    "Equal width binning divides the range of values of a feature into bins with equal width.\n",
    "\n",
    "Usually, we specify the number of bins as a hyper-parameter $K$, and then compute the width of each bin as\n",
    "\n",
    "$$\n",
    "\\begin{equation}\n",
    "    w = \\Big[\\frac{max^{(j)} - min^{(j)}}{K}\\Big]\n",
    "\\end{equation}\n",
    "$$\n",
    "where \n",
    "$max^{(j)}$ and $min^{(j)}$ are the $j^{th}$ feature's maximum and minimum values, respectively.\n",
    "\n",
    "\n",
    "The ranges of the $K$ bins are then\n",
    "$$\n",
    "\\begin{equation}\n",
    "\\begin{split}\n",
    "    Bin \\ 1&: [min, \\ min + w - 1] \\\\\n",
    "    Bin \\ 2&: [min+w, \\ min + 2\\cdot w - 1] \\\\\n",
    "    ... \\\\\n",
    "    Bin \\ K&: [min + (K-1)\\cdot w, \\ max]\n",
    "    \\label{eq:equal_width_binning}\n",
    "\\end{split}\n",
    "\\end{equation}\n",
    "$$\n",
    "\n",
    "As an example of equal width binning, consider splitting the Age feature in the Amsterdam demographics dataset into $K=3$ bins.\n",
    "\n",
    "The bin's width is:\n",
    "$$\n",
    "\\begin{equation*}\n",
    "    w = \\Big[\\frac{max-min}{x}\\Big] = \\Big[\\frac{54-25}{3}\\Big] = 9.7 \\approx 10\n",
    "\\end{equation*}\n",
    "$$\n",
    "\n",
    "which we rounded to the nearest integer because Age values are always integers (in this dataset).\n",
    "\n",
    "To implement equal width binning in Python and calculate each bin's range for the \"Age\" feature of the Amsterdam demographics dataset using $K=3$ bins, we can use the `numpy` library to help with calculations and then use `pandas` for binning. Here's how you can perform this task:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "id": "1b109d0b",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Number of bins\n",
    "K = 3\n",
    "\n",
    "# Calculate the width of each bin\n",
    "min_age = df['Age'].min()\n",
    "max_age = df['Age'].max()\n",
    "width = (max_age - min_age) // K"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "id": "4526a8b6",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[25.         34.66666667 44.33333333 54.        ]\n"
     ]
    }
   ],
   "source": [
    "# Define bins using calculated width\n",
    "bins = np.linspace(min_age, max_age, num=K+1)\n",
    "\n",
    "print(bins)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "id": "e74a40e4",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "['Bin 1', 'Bin 2', 'Bin 3']\n"
     ]
    }
   ],
   "source": [
    "# Create bin labels\n",
    "labels = [f'Bin {i+1}' for i in range(K)]\n",
    "\n",
    "print(labels)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "id": "63bc22fe",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Perform binning\n",
    "df['Age Bin'] = pd.cut(df['Age'], bins=bins, labels=labels, include_lowest=True, right=True)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "id": "0f9006aa",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Age</th>\n",
       "      <th>Income (€)</th>\n",
       "      <th>Vehicle</th>\n",
       "      <th>Kids</th>\n",
       "      <th>Residence</th>\n",
       "      <th>Age Group</th>\n",
       "      <th>Age Bin</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>32</td>\n",
       "      <td>95000</td>\n",
       "      <td>none</td>\n",
       "      <td>no</td>\n",
       "      <td>downtown</td>\n",
       "      <td>Young</td>\n",
       "      <td>Bin 1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>46</td>\n",
       "      <td>210000</td>\n",
       "      <td>car</td>\n",
       "      <td>yes</td>\n",
       "      <td>downtown</td>\n",
       "      <td>Old</td>\n",
       "      <td>Bin 3</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>25</td>\n",
       "      <td>75000</td>\n",
       "      <td>truck</td>\n",
       "      <td>yes</td>\n",
       "      <td>suburbs</td>\n",
       "      <td>Young</td>\n",
       "      <td>Bin 1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>36</td>\n",
       "      <td>30000</td>\n",
       "      <td>car</td>\n",
       "      <td>yes</td>\n",
       "      <td>suburbs</td>\n",
       "      <td>Middle</td>\n",
       "      <td>Bin 2</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>29</td>\n",
       "      <td>55000</td>\n",
       "      <td>none</td>\n",
       "      <td>no</td>\n",
       "      <td>suburbs</td>\n",
       "      <td>Young</td>\n",
       "      <td>Bin 1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>54</td>\n",
       "      <td>430000</td>\n",
       "      <td>car</td>\n",
       "      <td>yes</td>\n",
       "      <td>downtown</td>\n",
       "      <td>Old</td>\n",
       "      <td>Bin 3</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   Age  Income (€) Vehicle Kids Residence Age Group Age Bin\n",
       "0   32       95000    none   no  downtown     Young   Bin 1\n",
       "1   46      210000     car  yes  downtown       Old   Bin 3\n",
       "2   25       75000   truck  yes   suburbs     Young   Bin 1\n",
       "3   36       30000     car  yes   suburbs    Middle   Bin 2\n",
       "4   29       55000    none   no   suburbs     Young   Bin 1\n",
       "5   54      430000     car  yes  downtown       Old   Bin 3"
      ]
     },
     "execution_count": 17,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df"
   ]
  }
 ],
 "metadata": {
  "jupytext": {
   "formats": "md:myst",
   "text_representation": {
    "extension": ".md",
    "format_name": "myst"
   }
  },
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.10.7"
  },
  "source_map": [
   11,
   29,
   49,
   61,
   72,
   78,
   85,
   87,
   105,
   111,
   116,
   123,
   129,
   131,
   152,
   157,
   167,
   169,
   217,
   228,
   236,
   243,
   249
  ]
 },
 "nbformat": 4,
 "nbformat_minor": 5
}