{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "fda3b385",
   "metadata": {},
   "source": [
    "(chapter7_part1)=\n",
    "\n",
    "# Data Cleaning\n",
    "\n",
    "- This is a supplement material for the [Machine Learning Simplified](https://themlsbook.com) book. It sheds light on Python implementations of the topics discussed while all detailed explanations can be found in the book. \n",
    "- I also assume you know Python syntax and how it works. If you don't, I highly recommend you to take a break and get introduced to the language before going forward with my code. \n",
    "- This material can be downloaded as a Jupyter notebook (Download button in the upper-right corner -> `.ipynb`) to reproduce the code and play around with it. \n",
    "\n",
    "\n",
    "This notebook is a supplement for *Chapter 7. Data Preparation* of **Machine Learning For Everyone** book.\n",
    "\n",
    "## 1. Required Libraries, Data & Variables\n",
    "\n",
    "Let's import the data and have a look at it:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "id": "acea6065",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Customer ID</th>\n",
       "      <th>State</th>\n",
       "      <th>City</th>\n",
       "      <th>Postal Code</th>\n",
       "      <th>Ship Date</th>\n",
       "      <th>Purchase ($)</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>383</td>\n",
       "      <td>Pennsylvania</td>\n",
       "      <td>Drexel Hill</td>\n",
       "      <td>19026</td>\n",
       "      <td>23/08/2020</td>\n",
       "      <td>190.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>1997</td>\n",
       "      <td>Californai</td>\n",
       "      <td>Sacramento</td>\n",
       "      <td>94229</td>\n",
       "      <td>07/03/2020</td>\n",
       "      <td>243.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>698</td>\n",
       "      <td>California</td>\n",
       "      <td>Los Angeles</td>\n",
       "      <td>90058</td>\n",
       "      <td>14/09/2020</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>1314</td>\n",
       "      <td>Iowa</td>\n",
       "      <td>Fort Dodge</td>\n",
       "      <td>50501</td>\n",
       "      <td>29/02/2020</td>\n",
       "      <td>193.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>1314</td>\n",
       "      <td>Iowa</td>\n",
       "      <td>Fort Dodge</td>\n",
       "      <td>50501</td>\n",
       "      <td>29/02/2020</td>\n",
       "      <td>193.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>333</td>\n",
       "      <td>New York</td>\n",
       "      <td>Brooklyn</td>\n",
       "      <td>11249</td>\n",
       "      <td>14-09-2020</td>\n",
       "      <td>298.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6</th>\n",
       "      <td>1996</td>\n",
       "      <td>Washington</td>\n",
       "      <td>None</td>\n",
       "      <td>98101</td>\n",
       "      <td>19/05/2020</td>\n",
       "      <td>1.0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   Customer ID         State         City  Postal Code   Ship Date  \\\n",
       "0          383  Pennsylvania  Drexel Hill        19026  23/08/2020   \n",
       "1         1997    Californai   Sacramento        94229  07/03/2020   \n",
       "2          698    California  Los Angeles        90058  14/09/2020   \n",
       "3         1314          Iowa   Fort Dodge        50501  29/02/2020   \n",
       "4         1314          Iowa   Fort Dodge        50501  29/02/2020   \n",
       "5          333      New York     Brooklyn        11249  14-09-2020   \n",
       "6         1996    Washington         None        98101  19/05/2020   \n",
       "\n",
       "   Purchase ($)  \n",
       "0         190.0  \n",
       "1         243.0  \n",
       "2           NaN  \n",
       "3         193.0  \n",
       "4         193.0  \n",
       "5         298.0  \n",
       "6           1.0  "
      ]
     },
     "execution_count": 1,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "import pandas as pd\n",
    "\n",
    "# Define the data as a dictionary\n",
    "data = {\n",
    "    \"Customer ID\": [383, 1997, 698, 1314, 1314, 333, 1996],\n",
    "    \"State\": [\"Pennsylvania\", \"Californai\", \"California\", \"Iowa\", \"Iowa\", \"New York\", \"Washington\"],\n",
    "    \"City\": [\"Drexel Hill\", \"Sacramento\", \"Los Angeles\", \"Fort Dodge\", \"Fort Dodge\", \"Brooklyn\", None],\n",
    "    \"Postal Code\": [19026, 94229, 90058, 50501, 50501, 11249, 98101],\n",
    "    \"Ship Date\": [\"23/08/2020\", \"07/03/2020\", \"14/09/2020\", \"29/02/2020\", \"29/02/2020\", \"14-09-2020\", \"19/05/2020\"],\n",
    "    \"Purchase ($)\": [190, 243, None, 193, 193, 298, 1]\n",
    "}\n",
    "\n",
    "# Convert the dictionary to a DataFrame\n",
    "df = pd.DataFrame(data)\n",
    "\n",
    "# Print the DataFrame to the console\n",
    "df"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "150ce756",
   "metadata": {},
   "source": [
    "Table above contains a hypothetical dirty dataset of online product orders. This dataset has a number of issues, such as incorrect data, missing data, duplicated data, irrelevant data, and improperly formatted data, that make it impossible to apply ML algorithms right away. This section discusses methods that can be used to clean this data set such that ML algorithms can be applied to it\n",
    "\n",
    "## 2. Data Cleaning\n",
    "\n",
    "### 2.1. Incorrect Data\n",
    "\n",
    "Datasets may contain data that is clearly incorrect, such as spelling or syntax errors. The data point in the second row of Table has value “Californai” for its state feature, which is clearly a misspelling of the state “California”. If this mistake were left uncorrected, any ML algorithm built on this dataset would treat the two strings “Californai” and “California” differently. \n",
    "\n",
    "How can we identify incorrect data? One way to check whether a particular column has misspelled values is to look at its set of unique values, which is often much smaller than the set of all values itself."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "id": "7c52b47f",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array(['Pennsylvania', 'Californai', 'California', 'Iowa', 'New York',\n",
       "       'Washington'], dtype=object)"
      ]
     },
     "execution_count": 2,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df.State.unique()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "583506d8",
   "metadata": {},
   "source": [
    "We can fix misspelled 'Californai' with the code:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "id": "7b1f1b1a",
   "metadata": {},
   "outputs": [],
   "source": [
    "df['State'] = df['State'].replace('Californai', 'California')"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c874249b",
   "metadata": {},
   "source": [
    "Revisiting the dataframe, the problem has been fixed."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "id": "98f0336b",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Customer ID</th>\n",
       "      <th>State</th>\n",
       "      <th>City</th>\n",
       "      <th>Postal Code</th>\n",
       "      <th>Ship Date</th>\n",
       "      <th>Purchase ($)</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>383</td>\n",
       "      <td>Pennsylvania</td>\n",
       "      <td>Drexel Hill</td>\n",
       "      <td>19026</td>\n",
       "      <td>23/08/2020</td>\n",
       "      <td>190.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>1997</td>\n",
       "      <td>California</td>\n",
       "      <td>Sacramento</td>\n",
       "      <td>94229</td>\n",
       "      <td>07/03/2020</td>\n",
       "      <td>243.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>698</td>\n",
       "      <td>California</td>\n",
       "      <td>Los Angeles</td>\n",
       "      <td>90058</td>\n",
       "      <td>14/09/2020</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>1314</td>\n",
       "      <td>Iowa</td>\n",
       "      <td>Fort Dodge</td>\n",
       "      <td>50501</td>\n",
       "      <td>29/02/2020</td>\n",
       "      <td>193.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>1314</td>\n",
       "      <td>Iowa</td>\n",
       "      <td>Fort Dodge</td>\n",
       "      <td>50501</td>\n",
       "      <td>29/02/2020</td>\n",
       "      <td>193.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>333</td>\n",
       "      <td>New York</td>\n",
       "      <td>Brooklyn</td>\n",
       "      <td>11249</td>\n",
       "      <td>14-09-2020</td>\n",
       "      <td>298.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6</th>\n",
       "      <td>1996</td>\n",
       "      <td>Washington</td>\n",
       "      <td>None</td>\n",
       "      <td>98101</td>\n",
       "      <td>19/05/2020</td>\n",
       "      <td>1.0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   Customer ID         State         City  Postal Code   Ship Date  \\\n",
       "0          383  Pennsylvania  Drexel Hill        19026  23/08/2020   \n",
       "1         1997    California   Sacramento        94229  07/03/2020   \n",
       "2          698    California  Los Angeles        90058  14/09/2020   \n",
       "3         1314          Iowa   Fort Dodge        50501  29/02/2020   \n",
       "4         1314          Iowa   Fort Dodge        50501  29/02/2020   \n",
       "5          333      New York     Brooklyn        11249  14-09-2020   \n",
       "6         1996    Washington         None        98101  19/05/2020   \n",
       "\n",
       "   Purchase ($)  \n",
       "0         190.0  \n",
       "1         243.0  \n",
       "2           NaN  \n",
       "3         193.0  \n",
       "4         193.0  \n",
       "5         298.0  \n",
       "6           1.0  "
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "dbaad2cc",
   "metadata": {},
   "source": [
    "### 2.2. Improperly Formatted Data\n",
    "\n",
    "In some cases, we might have improperly formatted values. For instance, the Ship Date column in Table includes dates are improperly formatted, leading to misaligned date format. We need to standardize the format for all the dates, since an algorithm would treat the date 19-05-2020 and the date 19/05/2020 as two different dates, even though they are the same date in different formats.\n",
    "\n",
    "We can fix these inconsistences by using `pd.to_datetime` with `format='mixed'` argument to handle a mixture of date formats dynamically. This can be particularly useful if the dates are not consistently in one format."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "id": "7ff8543e",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Convert 'Ship Date' to datetime format (ISO format by default)\n",
    "df['Ship Date'] = pd.to_datetime(df['Ship Date'], format='mixed')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "id": "a6c7f89e",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Customer ID</th>\n",
       "      <th>State</th>\n",
       "      <th>City</th>\n",
       "      <th>Postal Code</th>\n",
       "      <th>Ship Date</th>\n",
       "      <th>Purchase ($)</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>383</td>\n",
       "      <td>Pennsylvania</td>\n",
       "      <td>Drexel Hill</td>\n",
       "      <td>19026</td>\n",
       "      <td>2020-08-23</td>\n",
       "      <td>190.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>1997</td>\n",
       "      <td>California</td>\n",
       "      <td>Sacramento</td>\n",
       "      <td>94229</td>\n",
       "      <td>2020-07-03</td>\n",
       "      <td>243.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>698</td>\n",
       "      <td>California</td>\n",
       "      <td>Los Angeles</td>\n",
       "      <td>90058</td>\n",
       "      <td>2020-09-14</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>1314</td>\n",
       "      <td>Iowa</td>\n",
       "      <td>Fort Dodge</td>\n",
       "      <td>50501</td>\n",
       "      <td>2020-02-29</td>\n",
       "      <td>193.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>1314</td>\n",
       "      <td>Iowa</td>\n",
       "      <td>Fort Dodge</td>\n",
       "      <td>50501</td>\n",
       "      <td>2020-02-29</td>\n",
       "      <td>193.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>333</td>\n",
       "      <td>New York</td>\n",
       "      <td>Brooklyn</td>\n",
       "      <td>11249</td>\n",
       "      <td>2020-09-14</td>\n",
       "      <td>298.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6</th>\n",
       "      <td>1996</td>\n",
       "      <td>Washington</td>\n",
       "      <td>None</td>\n",
       "      <td>98101</td>\n",
       "      <td>2020-05-19</td>\n",
       "      <td>1.0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   Customer ID         State         City  Postal Code  Ship Date  \\\n",
       "0          383  Pennsylvania  Drexel Hill        19026 2020-08-23   \n",
       "1         1997    California   Sacramento        94229 2020-07-03   \n",
       "2          698    California  Los Angeles        90058 2020-09-14   \n",
       "3         1314          Iowa   Fort Dodge        50501 2020-02-29   \n",
       "4         1314          Iowa   Fort Dodge        50501 2020-02-29   \n",
       "5          333      New York     Brooklyn        11249 2020-09-14   \n",
       "6         1996    Washington         None        98101 2020-05-19   \n",
       "\n",
       "   Purchase ($)  \n",
       "0         190.0  \n",
       "1         243.0  \n",
       "2           NaN  \n",
       "3         193.0  \n",
       "4         193.0  \n",
       "5         298.0  \n",
       "6           1.0  "
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "7ea87f7a",
   "metadata": {},
   "source": [
    "Let's format the 'Ship Date' to \"dd/mm/yyyy\" after conversion"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "id": "8790fd12",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Customer ID</th>\n",
       "      <th>State</th>\n",
       "      <th>City</th>\n",
       "      <th>Postal Code</th>\n",
       "      <th>Ship Date</th>\n",
       "      <th>Purchase ($)</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>383</td>\n",
       "      <td>Pennsylvania</td>\n",
       "      <td>Drexel Hill</td>\n",
       "      <td>19026</td>\n",
       "      <td>23/08/2020</td>\n",
       "      <td>190.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>1997</td>\n",
       "      <td>California</td>\n",
       "      <td>Sacramento</td>\n",
       "      <td>94229</td>\n",
       "      <td>03/07/2020</td>\n",
       "      <td>243.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>698</td>\n",
       "      <td>California</td>\n",
       "      <td>Los Angeles</td>\n",
       "      <td>90058</td>\n",
       "      <td>14/09/2020</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>1314</td>\n",
       "      <td>Iowa</td>\n",
       "      <td>Fort Dodge</td>\n",
       "      <td>50501</td>\n",
       "      <td>29/02/2020</td>\n",
       "      <td>193.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>1314</td>\n",
       "      <td>Iowa</td>\n",
       "      <td>Fort Dodge</td>\n",
       "      <td>50501</td>\n",
       "      <td>29/02/2020</td>\n",
       "      <td>193.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>333</td>\n",
       "      <td>New York</td>\n",
       "      <td>Brooklyn</td>\n",
       "      <td>11249</td>\n",
       "      <td>14/09/2020</td>\n",
       "      <td>298.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6</th>\n",
       "      <td>1996</td>\n",
       "      <td>Washington</td>\n",
       "      <td>None</td>\n",
       "      <td>98101</td>\n",
       "      <td>19/05/2020</td>\n",
       "      <td>1.0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   Customer ID         State         City  Postal Code   Ship Date  \\\n",
       "0          383  Pennsylvania  Drexel Hill        19026  23/08/2020   \n",
       "1         1997    California   Sacramento        94229  03/07/2020   \n",
       "2          698    California  Los Angeles        90058  14/09/2020   \n",
       "3         1314          Iowa   Fort Dodge        50501  29/02/2020   \n",
       "4         1314          Iowa   Fort Dodge        50501  29/02/2020   \n",
       "5          333      New York     Brooklyn        11249  14/09/2020   \n",
       "6         1996    Washington         None        98101  19/05/2020   \n",
       "\n",
       "   Purchase ($)  \n",
       "0         190.0  \n",
       "1         243.0  \n",
       "2           NaN  \n",
       "3         193.0  \n",
       "4         193.0  \n",
       "5         298.0  \n",
       "6           1.0  "
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df['Ship Date'] = df['Ship Date'].dt.strftime('%d/%m/%Y')\n",
    "\n",
    "df"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ae95a03c",
   "metadata": {},
   "source": [
    "### 2.3. Duplicated Data\n",
    "\n",
    "Duplicated data is another common problem that arises in practice. For example, provided dataset has duplicate observations in rows three and four, and in rows four and five. Duplicate data effectively doubles the weight that an ML algorithm gives to the data point and has the effect of incorrectly prioritizing some data points over others which can lead to a poor model. \n",
    "\n",
    "One of the ways to fix this is to use `drop_duplicates()` method in pandas:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "id": "8a897543",
   "metadata": {},
   "outputs": [],
   "source": [
    "df_cleaned = df.drop_duplicates()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "id": "60ab87f7",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Customer ID</th>\n",
       "      <th>State</th>\n",
       "      <th>City</th>\n",
       "      <th>Postal Code</th>\n",
       "      <th>Ship Date</th>\n",
       "      <th>Purchase ($)</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>383</td>\n",
       "      <td>Pennsylvania</td>\n",
       "      <td>Drexel Hill</td>\n",
       "      <td>19026</td>\n",
       "      <td>23/08/2020</td>\n",
       "      <td>190.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>1997</td>\n",
       "      <td>California</td>\n",
       "      <td>Sacramento</td>\n",
       "      <td>94229</td>\n",
       "      <td>03/07/2020</td>\n",
       "      <td>243.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>698</td>\n",
       "      <td>California</td>\n",
       "      <td>Los Angeles</td>\n",
       "      <td>90058</td>\n",
       "      <td>14/09/2020</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>1314</td>\n",
       "      <td>Iowa</td>\n",
       "      <td>Fort Dodge</td>\n",
       "      <td>50501</td>\n",
       "      <td>29/02/2020</td>\n",
       "      <td>193.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>333</td>\n",
       "      <td>New York</td>\n",
       "      <td>Brooklyn</td>\n",
       "      <td>11249</td>\n",
       "      <td>14/09/2020</td>\n",
       "      <td>298.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6</th>\n",
       "      <td>1996</td>\n",
       "      <td>Washington</td>\n",
       "      <td>None</td>\n",
       "      <td>98101</td>\n",
       "      <td>19/05/2020</td>\n",
       "      <td>1.0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   Customer ID         State         City  Postal Code   Ship Date  \\\n",
       "0          383  Pennsylvania  Drexel Hill        19026  23/08/2020   \n",
       "1         1997    California   Sacramento        94229  03/07/2020   \n",
       "2          698    California  Los Angeles        90058  14/09/2020   \n",
       "3         1314          Iowa   Fort Dodge        50501  29/02/2020   \n",
       "5          333      New York     Brooklyn        11249  14/09/2020   \n",
       "6         1996    Washington         None        98101  19/05/2020   \n",
       "\n",
       "   Purchase ($)  \n",
       "0         190.0  \n",
       "1         243.0  \n",
       "2           NaN  \n",
       "3         193.0  \n",
       "5         298.0  \n",
       "6           1.0  "
      ]
     },
     "execution_count": 9,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df_cleaned"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "d843db4c",
   "metadata": {},
   "source": [
    "### 2.4. Missing Data\n",
    "\n",
    "Missing data arises for a variety of reasons. For example, if the data is entered by a human being, he may have forgotten to input one or more values. Alternatively, data may be missing because it is genuinely unknown or unmeasured, such as, for example, a set of survey questions that were answered by some, but not all, customers. A missing value occurs in our running example for the purchase column in row three (Purchase column) and six (City column).\n",
    "\n",
    "#### 2.4.1. Missing Value in Purchase Column\n",
    "A missing value in Purchase column cannot be determined exactly and we need to make an educated guess at its value. For example, to impute a missing product order, we take the median order total."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "id": "2d37e298",
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "/var/folders/5y/7zvhsc3x5nx162713kvx9c1m0000gn/T/ipykernel_55086/3282763330.py:6: FutureWarning: A value is trying to be set on a copy of a DataFrame or Series through chained assignment using an inplace method.\n",
      "The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.\n",
      "\n",
      "For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.\n",
      "\n",
      "\n",
      "  df_cleaned['Purchase ($)'].fillna(median_purchase, inplace=True)\n",
      "/var/folders/5y/7zvhsc3x5nx162713kvx9c1m0000gn/T/ipykernel_55086/3282763330.py:6: SettingWithCopyWarning: \n",
      "A value is trying to be set on a copy of a slice from a DataFrame\n",
      "\n",
      "See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy\n",
      "  df_cleaned['Purchase ($)'].fillna(median_purchase, inplace=True)\n"
     ]
    }
   ],
   "source": [
    "# Calculate the median of the 'Purchase ($)' column, excluding NaN values\n",
    "median_purchase = df_cleaned['Purchase ($)'].median()\n",
    "\n",
    "\n",
    "# Impute missing values in the 'Purchase ($)' column with the median\n",
    "df_cleaned['Purchase ($)'].fillna(median_purchase, inplace=True)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "id": "52d1337f",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Customer ID</th>\n",
       "      <th>State</th>\n",
       "      <th>City</th>\n",
       "      <th>Postal Code</th>\n",
       "      <th>Ship Date</th>\n",
       "      <th>Purchase ($)</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>383</td>\n",
       "      <td>Pennsylvania</td>\n",
       "      <td>Drexel Hill</td>\n",
       "      <td>19026</td>\n",
       "      <td>23/08/2020</td>\n",
       "      <td>190.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>1997</td>\n",
       "      <td>California</td>\n",
       "      <td>Sacramento</td>\n",
       "      <td>94229</td>\n",
       "      <td>03/07/2020</td>\n",
       "      <td>243.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>698</td>\n",
       "      <td>California</td>\n",
       "      <td>Los Angeles</td>\n",
       "      <td>90058</td>\n",
       "      <td>14/09/2020</td>\n",
       "      <td>193.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>1314</td>\n",
       "      <td>Iowa</td>\n",
       "      <td>Fort Dodge</td>\n",
       "      <td>50501</td>\n",
       "      <td>29/02/2020</td>\n",
       "      <td>193.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>333</td>\n",
       "      <td>New York</td>\n",
       "      <td>Brooklyn</td>\n",
       "      <td>11249</td>\n",
       "      <td>14/09/2020</td>\n",
       "      <td>298.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6</th>\n",
       "      <td>1996</td>\n",
       "      <td>Washington</td>\n",
       "      <td>None</td>\n",
       "      <td>98101</td>\n",
       "      <td>19/05/2020</td>\n",
       "      <td>1.0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   Customer ID         State         City  Postal Code   Ship Date  \\\n",
       "0          383  Pennsylvania  Drexel Hill        19026  23/08/2020   \n",
       "1         1997    California   Sacramento        94229  03/07/2020   \n",
       "2          698    California  Los Angeles        90058  14/09/2020   \n",
       "3         1314          Iowa   Fort Dodge        50501  29/02/2020   \n",
       "5          333      New York     Brooklyn        11249  14/09/2020   \n",
       "6         1996    Washington         None        98101  19/05/2020   \n",
       "\n",
       "   Purchase ($)  \n",
       "0         190.0  \n",
       "1         243.0  \n",
       "2         193.0  \n",
       "3         193.0  \n",
       "5         298.0  \n",
       "6           1.0  "
      ]
     },
     "execution_count": 11,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df_cleaned"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "a032f5dc",
   "metadata": {},
   "source": [
    "#### 2.4.2. Missing Value in City Column\n",
    "Sometimes the missing value can be determined exactly. For example, if the US State of an order is missing, but we have its zip code, we can determine its state exactly (assuming we have another table which maps zip codes to states) and fill it into the table.\n",
    "\n",
    "To fill in missing values for the 'City' column using the zip code when you have another table that maps zip codes to cities, you can use `pandas` library to merge these dataframes effectively. Here's how you can perform this operation:\n",
    "\n",
    "- Create a mapping table: This table will map zip codes to their corresponding cities.\n",
    "- Merge this mapping table with your main data table: This merge operation will use the zip code as a key.\n",
    "- Update the 'City' column in the main table: If the 'City' is missing but can be found through the mapping table, update it accordingly."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "id": "88efdeb0",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Mapping table that relates Postal Codes to Cities\n",
    "zip_to_city = {\n",
    "    \"Postal Code\": [94229, 50501, 98101, 11249, 90058, 19026],\n",
    "    \"City\": [\"Sacramento\", \"Fort Dodge\", \"Seattle\", \"Brooklyn\", \"Los Angeles\", \"Drexel Hill\"]\n",
    "}\n",
    "\n",
    "df_mapping = pd.DataFrame(zip_to_city)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "id": "139ab89c",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Postal Code</th>\n",
       "      <th>City</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>94229</td>\n",
       "      <td>Sacramento</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>50501</td>\n",
       "      <td>Fort Dodge</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>98101</td>\n",
       "      <td>Seattle</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>11249</td>\n",
       "      <td>Brooklyn</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>90058</td>\n",
       "      <td>Los Angeles</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>19026</td>\n",
       "      <td>Drexel Hill</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   Postal Code         City\n",
       "0        94229   Sacramento\n",
       "1        50501   Fort Dodge\n",
       "2        98101      Seattle\n",
       "3        11249     Brooklyn\n",
       "4        90058  Los Angeles\n",
       "5        19026  Drexel Hill"
      ]
     },
     "execution_count": 13,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df_mapping"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "id": "1075ea5f",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Merge the main DataFrame with the mapping DataFrame on 'Postal Code'\n",
    "df_merged = df_cleaned.merge(df_mapping, on=\"Postal Code\", how=\"left\", suffixes=('', '_mapped'))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "id": "40a82cce",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Customer ID</th>\n",
       "      <th>State</th>\n",
       "      <th>City</th>\n",
       "      <th>Postal Code</th>\n",
       "      <th>Ship Date</th>\n",
       "      <th>Purchase ($)</th>\n",
       "      <th>City_mapped</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>383</td>\n",
       "      <td>Pennsylvania</td>\n",
       "      <td>Drexel Hill</td>\n",
       "      <td>19026</td>\n",
       "      <td>23/08/2020</td>\n",
       "      <td>190.0</td>\n",
       "      <td>Drexel Hill</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>1997</td>\n",
       "      <td>California</td>\n",
       "      <td>Sacramento</td>\n",
       "      <td>94229</td>\n",
       "      <td>03/07/2020</td>\n",
       "      <td>243.0</td>\n",
       "      <td>Sacramento</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>698</td>\n",
       "      <td>California</td>\n",
       "      <td>Los Angeles</td>\n",
       "      <td>90058</td>\n",
       "      <td>14/09/2020</td>\n",
       "      <td>193.0</td>\n",
       "      <td>Los Angeles</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>1314</td>\n",
       "      <td>Iowa</td>\n",
       "      <td>Fort Dodge</td>\n",
       "      <td>50501</td>\n",
       "      <td>29/02/2020</td>\n",
       "      <td>193.0</td>\n",
       "      <td>Fort Dodge</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>333</td>\n",
       "      <td>New York</td>\n",
       "      <td>Brooklyn</td>\n",
       "      <td>11249</td>\n",
       "      <td>14/09/2020</td>\n",
       "      <td>298.0</td>\n",
       "      <td>Brooklyn</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>1996</td>\n",
       "      <td>Washington</td>\n",
       "      <td>None</td>\n",
       "      <td>98101</td>\n",
       "      <td>19/05/2020</td>\n",
       "      <td>1.0</td>\n",
       "      <td>Seattle</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   Customer ID         State         City  Postal Code   Ship Date  \\\n",
       "0          383  Pennsylvania  Drexel Hill        19026  23/08/2020   \n",
       "1         1997    California   Sacramento        94229  03/07/2020   \n",
       "2          698    California  Los Angeles        90058  14/09/2020   \n",
       "3         1314          Iowa   Fort Dodge        50501  29/02/2020   \n",
       "4          333      New York     Brooklyn        11249  14/09/2020   \n",
       "5         1996    Washington         None        98101  19/05/2020   \n",
       "\n",
       "   Purchase ($)  City_mapped  \n",
       "0         190.0  Drexel Hill  \n",
       "1         243.0   Sacramento  \n",
       "2         193.0  Los Angeles  \n",
       "3         193.0   Fort Dodge  \n",
       "4         298.0     Brooklyn  \n",
       "5           1.0      Seattle  "
      ]
     },
     "execution_count": 15,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df_merged"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "f3bd0214",
   "metadata": {},
   "source": [
    "### 2.5. Outliers\n",
    "\n",
    "An outlier is an observation that differs significantly from other observations. Outliers may be problematic for one of two reasons: \n",
    "- first, an outlier may simply not be representative of data that we will see at test time (in a new dataset); \n",
    "- second, many ML algorithms are sensitive to severe outliers and often learn models which focuses too heavily on outliers and consequently make poor predictions on the rest of the data points. \n",
    "\n",
    "There are no hard and fast rules about how to classify a point as an outlier and whether or not to remove it from the dataset. Usually, you will build ML models several times, both with and without outliers, and with different methods of outlier categorization. This subsection shows a statistical test that is commonly performed in practice to determine an outlier.\n",
    "\n",
    "How can we use statistical metrics to determine if a data point is an outlier? The simplest way is to identify if a datapoint is too far away from the average value."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "id": "c6aca4d5",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Mean = 186.33333333333334, Standard Deviation = 100.13124720419029\n"
     ]
    }
   ],
   "source": [
    "mean_value = df_cleaned['Purchase ($)'].mean()\n",
    "std_dev = df_cleaned['Purchase ($)'].std()\n",
    "\n",
    "print(f\"Mean = {mean_value}, Standard Deviation = {std_dev}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "4a1f7a3e",
   "metadata": {},
   "source": [
    "Suppose we set a range of acceptable values of k = 3 standard deviations. Then:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "id": "3db8d15a",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Anything below -114.06040827923752 and above 486.7270749459042 is considered as an outlier.\n"
     ]
    }
   ],
   "source": [
    "# Find outliers based on the defined threshold\n",
    "k=3\n",
    "\n",
    "lower_bound = mean_value - k * std_dev\n",
    "upper_bound = mean_value + k * std_dev\n",
    "\n",
    "print(f\"Anything below {lower_bound} and above {upper_bound} is considered as an outlier.\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "d8db0ebb",
   "metadata": {},
   "source": [
    "But since we cannot have a purchase with a negative sum, outlier would be above 486.72. In this data set, there are no outliers present."
   ]
  }
 ],
 "metadata": {
  "jupytext": {
   "formats": "md:myst",
   "text_representation": {
    "extension": ".md",
    "format_name": "myst"
   }
  },
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.10.7"
  },
  "source_map": [
   11,
   29,
   47,
   62,
   64,
   69,
   71,
   76,
   78,
   89,
   95,
   97,
   102,
   106,
   116,
   121,
   123,
   135,
   144,
   146,
   160,
   171,
   176,
   182,
   184,
   198,
   203,
   208,
   216
  ]
 },
 "nbformat": 4,
 "nbformat_minor": 5
}