{ "cells": [ { "cell_type": "markdown", "id": "fda3b385", "metadata": {}, "source": [ "(chapter7_part1)=\n", "\n", "# Data Cleaning\n", "\n", "- This is a supplement material for the [Machine Learning Simplified](https://themlsbook.com) book. It sheds light on Python implementations of the topics discussed while all detailed explanations can be found in the book. \n", "- I also assume you know Python syntax and how it works. If you don't, I highly recommend you to take a break and get introduced to the language before going forward with my code. \n", "- This material can be downloaded as a Jupyter notebook (Download button in the upper-right corner -> `.ipynb`) to reproduce the code and play around with it. \n", "\n", "\n", "This notebook is a supplement for *Chapter 7. Data Preparation* of **Machine Learning For Everyone** book.\n", "\n", "## 1. Required Libraries, Data & Variables\n", "\n", "Let's import the data and have a look at it:" ] }, { "cell_type": "code", "execution_count": 1, "id": "acea6065", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Customer IDStateCityPostal CodeShip DatePurchase ($)
0383PennsylvaniaDrexel Hill1902623/08/2020190.0
11997CalifornaiSacramento9422907/03/2020243.0
2698CaliforniaLos Angeles9005814/09/2020NaN
31314IowaFort Dodge5050129/02/2020193.0
41314IowaFort Dodge5050129/02/2020193.0
5333New YorkBrooklyn1124914-09-2020298.0
61996WashingtonNone9810119/05/20201.0
\n", "
" ], "text/plain": [ " Customer ID State City Postal Code Ship Date \\\n", "0 383 Pennsylvania Drexel Hill 19026 23/08/2020 \n", "1 1997 Californai Sacramento 94229 07/03/2020 \n", "2 698 California Los Angeles 90058 14/09/2020 \n", "3 1314 Iowa Fort Dodge 50501 29/02/2020 \n", "4 1314 Iowa Fort Dodge 50501 29/02/2020 \n", "5 333 New York Brooklyn 11249 14-09-2020 \n", "6 1996 Washington None 98101 19/05/2020 \n", "\n", " Purchase ($) \n", "0 190.0 \n", "1 243.0 \n", "2 NaN \n", "3 193.0 \n", "4 193.0 \n", "5 298.0 \n", "6 1.0 " ] }, "execution_count": 1, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import pandas as pd\n", "\n", "# Define the data as a dictionary\n", "data = {\n", " \"Customer ID\": [383, 1997, 698, 1314, 1314, 333, 1996],\n", " \"State\": [\"Pennsylvania\", \"Californai\", \"California\", \"Iowa\", \"Iowa\", \"New York\", \"Washington\"],\n", " \"City\": [\"Drexel Hill\", \"Sacramento\", \"Los Angeles\", \"Fort Dodge\", \"Fort Dodge\", \"Brooklyn\", None],\n", " \"Postal Code\": [19026, 94229, 90058, 50501, 50501, 11249, 98101],\n", " \"Ship Date\": [\"23/08/2020\", \"07/03/2020\", \"14/09/2020\", \"29/02/2020\", \"29/02/2020\", \"14-09-2020\", \"19/05/2020\"],\n", " \"Purchase ($)\": [190, 243, None, 193, 193, 298, 1]\n", "}\n", "\n", "# Convert the dictionary to a DataFrame\n", "df = pd.DataFrame(data)\n", "\n", "# Print the DataFrame to the console\n", "df" ] }, { "cell_type": "markdown", "id": "150ce756", "metadata": {}, "source": [ "Table above contains a hypothetical dirty dataset of online product orders. This dataset has a number of issues, such as incorrect data, missing data, duplicated data, irrelevant data, and improperly formatted data, that make it impossible to apply ML algorithms right away. This section discusses methods that can be used to clean this data set such that ML algorithms can be applied to it\n", "\n", "## 2. Data Cleaning\n", "\n", "### 2.1. Incorrect Data\n", "\n", "Datasets may contain data that is clearly incorrect, such as spelling or syntax errors. The data point in the second row of Table has value “Californai” for its state feature, which is clearly a misspelling of the state “California”. If this mistake were left uncorrected, any ML algorithm built on this dataset would treat the two strings “Californai” and “California” differently. \n", "\n", "How can we identify incorrect data? One way to check whether a particular column has misspelled values is to look at its set of unique values, which is often much smaller than the set of all values itself." ] }, { "cell_type": "code", "execution_count": 2, "id": "7c52b47f", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array(['Pennsylvania', 'Californai', 'California', 'Iowa', 'New York',\n", " 'Washington'], dtype=object)" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.State.unique()" ] }, { "cell_type": "markdown", "id": "583506d8", "metadata": {}, "source": [ "We can fix misspelled 'Californai' with the code:" ] }, { "cell_type": "code", "execution_count": 3, "id": "7b1f1b1a", "metadata": {}, "outputs": [], "source": [ "df['State'] = df['State'].replace('Californai', 'California')" ] }, { "cell_type": "markdown", "id": "c874249b", "metadata": {}, "source": [ "Revisiting the dataframe, the problem has been fixed." ] }, { "cell_type": "code", "execution_count": 4, "id": "98f0336b", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Customer IDStateCityPostal CodeShip DatePurchase ($)
0383PennsylvaniaDrexel Hill1902623/08/2020190.0
11997CaliforniaSacramento9422907/03/2020243.0
2698CaliforniaLos Angeles9005814/09/2020NaN
31314IowaFort Dodge5050129/02/2020193.0
41314IowaFort Dodge5050129/02/2020193.0
5333New YorkBrooklyn1124914-09-2020298.0
61996WashingtonNone9810119/05/20201.0
\n", "
" ], "text/plain": [ " Customer ID State City Postal Code Ship Date \\\n", "0 383 Pennsylvania Drexel Hill 19026 23/08/2020 \n", "1 1997 California Sacramento 94229 07/03/2020 \n", "2 698 California Los Angeles 90058 14/09/2020 \n", "3 1314 Iowa Fort Dodge 50501 29/02/2020 \n", "4 1314 Iowa Fort Dodge 50501 29/02/2020 \n", "5 333 New York Brooklyn 11249 14-09-2020 \n", "6 1996 Washington None 98101 19/05/2020 \n", "\n", " Purchase ($) \n", "0 190.0 \n", "1 243.0 \n", "2 NaN \n", "3 193.0 \n", "4 193.0 \n", "5 298.0 \n", "6 1.0 " ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df" ] }, { "cell_type": "markdown", "id": "dbaad2cc", "metadata": {}, "source": [ "### 2.2. Improperly Formatted Data\n", "\n", "In some cases, we might have improperly formatted values. For instance, the Ship Date column in Table includes dates are improperly formatted, leading to misaligned date format. We need to standardize the format for all the dates, since an algorithm would treat the date 19-05-2020 and the date 19/05/2020 as two different dates, even though they are the same date in different formats.\n", "\n", "We can fix these inconsistences by using `pd.to_datetime` with `format='mixed'` argument to handle a mixture of date formats dynamically. This can be particularly useful if the dates are not consistently in one format." ] }, { "cell_type": "code", "execution_count": 5, "id": "7ff8543e", "metadata": {}, "outputs": [], "source": [ "# Convert 'Ship Date' to datetime format (ISO format by default)\n", "df['Ship Date'] = pd.to_datetime(df['Ship Date'], format='mixed')" ] }, { "cell_type": "code", "execution_count": 6, "id": "a6c7f89e", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Customer IDStateCityPostal CodeShip DatePurchase ($)
0383PennsylvaniaDrexel Hill190262020-08-23190.0
11997CaliforniaSacramento942292020-07-03243.0
2698CaliforniaLos Angeles900582020-09-14NaN
31314IowaFort Dodge505012020-02-29193.0
41314IowaFort Dodge505012020-02-29193.0
5333New YorkBrooklyn112492020-09-14298.0
61996WashingtonNone981012020-05-191.0
\n", "
" ], "text/plain": [ " Customer ID State City Postal Code Ship Date \\\n", "0 383 Pennsylvania Drexel Hill 19026 2020-08-23 \n", "1 1997 California Sacramento 94229 2020-07-03 \n", "2 698 California Los Angeles 90058 2020-09-14 \n", "3 1314 Iowa Fort Dodge 50501 2020-02-29 \n", "4 1314 Iowa Fort Dodge 50501 2020-02-29 \n", "5 333 New York Brooklyn 11249 2020-09-14 \n", "6 1996 Washington None 98101 2020-05-19 \n", "\n", " Purchase ($) \n", "0 190.0 \n", "1 243.0 \n", "2 NaN \n", "3 193.0 \n", "4 193.0 \n", "5 298.0 \n", "6 1.0 " ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df" ] }, { "cell_type": "markdown", "id": "7ea87f7a", "metadata": {}, "source": [ "Let's format the 'Ship Date' to \"dd/mm/yyyy\" after conversion" ] }, { "cell_type": "code", "execution_count": 7, "id": "8790fd12", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Customer IDStateCityPostal CodeShip DatePurchase ($)
0383PennsylvaniaDrexel Hill1902623/08/2020190.0
11997CaliforniaSacramento9422903/07/2020243.0
2698CaliforniaLos Angeles9005814/09/2020NaN
31314IowaFort Dodge5050129/02/2020193.0
41314IowaFort Dodge5050129/02/2020193.0
5333New YorkBrooklyn1124914/09/2020298.0
61996WashingtonNone9810119/05/20201.0
\n", "
" ], "text/plain": [ " Customer ID State City Postal Code Ship Date \\\n", "0 383 Pennsylvania Drexel Hill 19026 23/08/2020 \n", "1 1997 California Sacramento 94229 03/07/2020 \n", "2 698 California Los Angeles 90058 14/09/2020 \n", "3 1314 Iowa Fort Dodge 50501 29/02/2020 \n", "4 1314 Iowa Fort Dodge 50501 29/02/2020 \n", "5 333 New York Brooklyn 11249 14/09/2020 \n", "6 1996 Washington None 98101 19/05/2020 \n", "\n", " Purchase ($) \n", "0 190.0 \n", "1 243.0 \n", "2 NaN \n", "3 193.0 \n", "4 193.0 \n", "5 298.0 \n", "6 1.0 " ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df['Ship Date'] = df['Ship Date'].dt.strftime('%d/%m/%Y')\n", "\n", "df" ] }, { "cell_type": "markdown", "id": "ae95a03c", "metadata": {}, "source": [ "### 2.3. Duplicated Data\n", "\n", "Duplicated data is another common problem that arises in practice. For example, provided dataset has duplicate observations in rows three and four, and in rows four and five. Duplicate data effectively doubles the weight that an ML algorithm gives to the data point and has the effect of incorrectly prioritizing some data points over others which can lead to a poor model. \n", "\n", "One of the ways to fix this is to use `drop_duplicates()` method in pandas:" ] }, { "cell_type": "code", "execution_count": 8, "id": "8a897543", "metadata": {}, "outputs": [], "source": [ "df_cleaned = df.drop_duplicates()" ] }, { "cell_type": "code", "execution_count": 9, "id": "60ab87f7", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Customer IDStateCityPostal CodeShip DatePurchase ($)
0383PennsylvaniaDrexel Hill1902623/08/2020190.0
11997CaliforniaSacramento9422903/07/2020243.0
2698CaliforniaLos Angeles9005814/09/2020NaN
31314IowaFort Dodge5050129/02/2020193.0
5333New YorkBrooklyn1124914/09/2020298.0
61996WashingtonNone9810119/05/20201.0
\n", "
" ], "text/plain": [ " Customer ID State City Postal Code Ship Date \\\n", "0 383 Pennsylvania Drexel Hill 19026 23/08/2020 \n", "1 1997 California Sacramento 94229 03/07/2020 \n", "2 698 California Los Angeles 90058 14/09/2020 \n", "3 1314 Iowa Fort Dodge 50501 29/02/2020 \n", "5 333 New York Brooklyn 11249 14/09/2020 \n", "6 1996 Washington None 98101 19/05/2020 \n", "\n", " Purchase ($) \n", "0 190.0 \n", "1 243.0 \n", "2 NaN \n", "3 193.0 \n", "5 298.0 \n", "6 1.0 " ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_cleaned" ] }, { "cell_type": "markdown", "id": "d843db4c", "metadata": {}, "source": [ "### 2.4. Missing Data\n", "\n", "Missing data arises for a variety of reasons. For example, if the data is entered by a human being, he may have forgotten to input one or more values. Alternatively, data may be missing because it is genuinely unknown or unmeasured, such as, for example, a set of survey questions that were answered by some, but not all, customers. A missing value occurs in our running example for the purchase column in row three (Purchase column) and six (City column).\n", "\n", "#### 2.4.1. Missing Value in Purchase Column\n", "A missing value in Purchase column cannot be determined exactly and we need to make an educated guess at its value. For example, to impute a missing product order, we take the median order total." ] }, { "cell_type": "code", "execution_count": 10, "id": "2d37e298", "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "/var/folders/5y/7zvhsc3x5nx162713kvx9c1m0000gn/T/ipykernel_55086/3282763330.py:6: FutureWarning: A value is trying to be set on a copy of a DataFrame or Series through chained assignment using an inplace method.\n", "The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.\n", "\n", "For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.\n", "\n", "\n", " df_cleaned['Purchase ($)'].fillna(median_purchase, inplace=True)\n", "/var/folders/5y/7zvhsc3x5nx162713kvx9c1m0000gn/T/ipykernel_55086/3282763330.py:6: SettingWithCopyWarning: \n", "A value is trying to be set on a copy of a slice from a DataFrame\n", "\n", "See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy\n", " df_cleaned['Purchase ($)'].fillna(median_purchase, inplace=True)\n" ] } ], "source": [ "# Calculate the median of the 'Purchase ($)' column, excluding NaN values\n", "median_purchase = df_cleaned['Purchase ($)'].median()\n", "\n", "\n", "# Impute missing values in the 'Purchase ($)' column with the median\n", "df_cleaned['Purchase ($)'].fillna(median_purchase, inplace=True)" ] }, { "cell_type": "code", "execution_count": 11, "id": "52d1337f", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Customer IDStateCityPostal CodeShip DatePurchase ($)
0383PennsylvaniaDrexel Hill1902623/08/2020190.0
11997CaliforniaSacramento9422903/07/2020243.0
2698CaliforniaLos Angeles9005814/09/2020193.0
31314IowaFort Dodge5050129/02/2020193.0
5333New YorkBrooklyn1124914/09/2020298.0
61996WashingtonNone9810119/05/20201.0
\n", "
" ], "text/plain": [ " Customer ID State City Postal Code Ship Date \\\n", "0 383 Pennsylvania Drexel Hill 19026 23/08/2020 \n", "1 1997 California Sacramento 94229 03/07/2020 \n", "2 698 California Los Angeles 90058 14/09/2020 \n", "3 1314 Iowa Fort Dodge 50501 29/02/2020 \n", "5 333 New York Brooklyn 11249 14/09/2020 \n", "6 1996 Washington None 98101 19/05/2020 \n", "\n", " Purchase ($) \n", "0 190.0 \n", "1 243.0 \n", "2 193.0 \n", "3 193.0 \n", "5 298.0 \n", "6 1.0 " ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_cleaned" ] }, { "cell_type": "markdown", "id": "a032f5dc", "metadata": {}, "source": [ "#### 2.4.2. Missing Value in City Column\n", "Sometimes the missing value can be determined exactly. For example, if the US State of an order is missing, but we have its zip code, we can determine its state exactly (assuming we have another table which maps zip codes to states) and fill it into the table.\n", "\n", "To fill in missing values for the 'City' column using the zip code when you have another table that maps zip codes to cities, you can use `pandas` library to merge these dataframes effectively. Here's how you can perform this operation:\n", "\n", "- Create a mapping table: This table will map zip codes to their corresponding cities.\n", "- Merge this mapping table with your main data table: This merge operation will use the zip code as a key.\n", "- Update the 'City' column in the main table: If the 'City' is missing but can be found through the mapping table, update it accordingly." ] }, { "cell_type": "code", "execution_count": 12, "id": "88efdeb0", "metadata": {}, "outputs": [], "source": [ "# Mapping table that relates Postal Codes to Cities\n", "zip_to_city = {\n", " \"Postal Code\": [94229, 50501, 98101, 11249, 90058, 19026],\n", " \"City\": [\"Sacramento\", \"Fort Dodge\", \"Seattle\", \"Brooklyn\", \"Los Angeles\", \"Drexel Hill\"]\n", "}\n", "\n", "df_mapping = pd.DataFrame(zip_to_city)" ] }, { "cell_type": "code", "execution_count": 13, "id": "139ab89c", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Postal CodeCity
094229Sacramento
150501Fort Dodge
298101Seattle
311249Brooklyn
490058Los Angeles
519026Drexel Hill
\n", "
" ], "text/plain": [ " Postal Code City\n", "0 94229 Sacramento\n", "1 50501 Fort Dodge\n", "2 98101 Seattle\n", "3 11249 Brooklyn\n", "4 90058 Los Angeles\n", "5 19026 Drexel Hill" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_mapping" ] }, { "cell_type": "code", "execution_count": 14, "id": "1075ea5f", "metadata": {}, "outputs": [], "source": [ "# Merge the main DataFrame with the mapping DataFrame on 'Postal Code'\n", "df_merged = df_cleaned.merge(df_mapping, on=\"Postal Code\", how=\"left\", suffixes=('', '_mapped'))" ] }, { "cell_type": "code", "execution_count": 15, "id": "40a82cce", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Customer IDStateCityPostal CodeShip DatePurchase ($)City_mapped
0383PennsylvaniaDrexel Hill1902623/08/2020190.0Drexel Hill
11997CaliforniaSacramento9422903/07/2020243.0Sacramento
2698CaliforniaLos Angeles9005814/09/2020193.0Los Angeles
31314IowaFort Dodge5050129/02/2020193.0Fort Dodge
4333New YorkBrooklyn1124914/09/2020298.0Brooklyn
51996WashingtonNone9810119/05/20201.0Seattle
\n", "
" ], "text/plain": [ " Customer ID State City Postal Code Ship Date \\\n", "0 383 Pennsylvania Drexel Hill 19026 23/08/2020 \n", "1 1997 California Sacramento 94229 03/07/2020 \n", "2 698 California Los Angeles 90058 14/09/2020 \n", "3 1314 Iowa Fort Dodge 50501 29/02/2020 \n", "4 333 New York Brooklyn 11249 14/09/2020 \n", "5 1996 Washington None 98101 19/05/2020 \n", "\n", " Purchase ($) City_mapped \n", "0 190.0 Drexel Hill \n", "1 243.0 Sacramento \n", "2 193.0 Los Angeles \n", "3 193.0 Fort Dodge \n", "4 298.0 Brooklyn \n", "5 1.0 Seattle " ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_merged" ] }, { "cell_type": "markdown", "id": "f3bd0214", "metadata": {}, "source": [ "### 2.5. Outliers\n", "\n", "An outlier is an observation that differs significantly from other observations. Outliers may be problematic for one of two reasons: \n", "- first, an outlier may simply not be representative of data that we will see at test time (in a new dataset); \n", "- second, many ML algorithms are sensitive to severe outliers and often learn models which focuses too heavily on outliers and consequently make poor predictions on the rest of the data points. \n", "\n", "There are no hard and fast rules about how to classify a point as an outlier and whether or not to remove it from the dataset. Usually, you will build ML models several times, both with and without outliers, and with different methods of outlier categorization. This subsection shows a statistical test that is commonly performed in practice to determine an outlier.\n", "\n", "How can we use statistical metrics to determine if a data point is an outlier? The simplest way is to identify if a datapoint is too far away from the average value." ] }, { "cell_type": "code", "execution_count": 16, "id": "c6aca4d5", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Mean = 186.33333333333334, Standard Deviation = 100.13124720419029\n" ] } ], "source": [ "mean_value = df_cleaned['Purchase ($)'].mean()\n", "std_dev = df_cleaned['Purchase ($)'].std()\n", "\n", "print(f\"Mean = {mean_value}, Standard Deviation = {std_dev}\")" ] }, { "cell_type": "markdown", "id": "4a1f7a3e", "metadata": {}, "source": [ "Suppose we set a range of acceptable values of k = 3 standard deviations. Then:" ] }, { "cell_type": "code", "execution_count": 17, "id": "3db8d15a", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Anything below -114.06040827923752 and above 486.7270749459042 is considered as an outlier.\n" ] } ], "source": [ "# Find outliers based on the defined threshold\n", "k=3\n", "\n", "lower_bound = mean_value - k * std_dev\n", "upper_bound = mean_value + k * std_dev\n", "\n", "print(f\"Anything below {lower_bound} and above {upper_bound} is considered as an outlier.\")" ] }, { "cell_type": "markdown", "id": "d8db0ebb", "metadata": {}, "source": [ "But since we cannot have a purchase with a negative sum, outlier would be above 486.72. In this data set, there are no outliers present." ] } ], "metadata": { "jupytext": { "formats": "md:myst", "text_representation": { "extension": ".md", "format_name": "myst" } }, "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.10.7" }, "source_map": [ 11, 29, 47, 62, 64, 69, 71, 76, 78, 89, 95, 97, 102, 106, 116, 121, 123, 135, 144, 146, 160, 171, 176, 182, 184, 198, 203, 208, 216 ] }, "nbformat": 4, "nbformat_minor": 5 }