Data Augmentation#

  • This is a supplement material for the Machine Learning Simplified book. It sheds light on Python implementations of the topics discussed while all detailed explanations can be found in the book.

  • I also assume you know Python syntax and how it works. If you don’t, I highly recommend you to take a break and get introduced to the language before going forward with my code.

  • This material can be downloaded as a Jupyter notebook (Download button in the upper-right corner -> .ipynb) to reproduce the code and play around with it.

This notebook is a supplement for Chapter 7. Data Preparation of Machine Learning For Everyone book.

1. Required Libraries, Data & Variables#

Let’s import needed libraries:

import pandas as pd
from imblearn.over_sampling import SMOTE
from sklearn.preprocessing import LabelEncoder

Imagine you obtained the following dataset from Silver Suchs Bank. This dataset contains 55 observations of bank transaction over a certain period of time. The target column \(Status\) has two classes: \(Fraud\) for fraudulent transactions and \(Legit\) for legal transactions. Imagine that out of 55 observations in the dataset, there are 50 legal transactions (class \(Legit\)) and only 5 fraudulent transactions (class \(Fraud\)). These two classes are imbalanced.

# Data for 55 transactions, out of which 5 are Fraud class
data = {
    "#": range(1, 56),
    "date": [
        "21/08/2020", "24/12/2020", "10/04/2020", "13/03/2020", "08/10/2020", "02/04/2020",
        "15/05/2020", "18/07/2020", "20/06/2020", "22/08/2020", "27/11/2020", "30/01/2020",
        "14/02/2020", "17/04/2020", "19/06/2020", "21/08/2020", "26/12/2020", "29/02/2020",
        "12/03/2020", "15/05/2020", "17/07/2020", "19/09/2020", "23/10/2020", "25/12/2020",
        "28/02/2020", "10/01/2020", "13/03/2020", "15/05/2020", "17/07/2020", "19/09/2020",
        "22/11/2020", "24/01/2020", "27/03/2020", "29/05/2020", "31/07/2020", "02/10/2020",
        "04/12/2020", "06/02/2020", "09/04/2020", "11/06/2020", "13/08/2020", "16/10/2020",
        "18/12/2020", "20/02/2020", "23/04/2020", "25/06/2020", "27/08/2020", "30/10/2020",
        "02/12/2020", "04/02/2020", "07/04/2020", "09/06/2020", "11/08/2020", "14/10/2020",
        "16/12/2020"
    ],
    "time": [
        "02:00", "05:19", "18:06", "19:01", "15:34", "23:58",
        "00:45", "01:15", "02:30", "03:50", "04:20", "05:45",
        "06:55", "07:25", "08:15", "09:35", "10:10", "11:20",
        "12:05", "13:30", "14:50", "15:40", "16:30", "17:20",
        "18:00", "19:10", "20:05", "21:15", "22:50", "23:30",
        "00:25", "01:35", "02:45", "03:55", "04:50", "05:10",
        "06:25", "07:35", "08:45", "09:55", "10:50", "11:00",
        "12:15", "13:25", "14:35", "15:45", "16:40", "17:50",
        "18:05", "19:15", "20:25", "21:35", "22:45", "23:55",
        "00:05"
    ],
    "location": [
        "Amsterdam", "Dusseldorf", "Berlin", "Belgium", "Paris", "Amsterdam",
        "Dusseldorf", "Berlin", "Belgium", "Paris", "Amsterdam", "Dusseldorf",
        "Berlin", "Belgium", "Paris", "Amsterdam", "Dusseldorf", "Berlin",
        "Belgium", "Paris", "Amsterdam", "Dusseldorf", "Berlin", "Belgium",
        "Paris", "Amsterdam", "Dusseldorf", "Berlin", "Belgium", "Paris",
        "Amsterdam", "Dusseldorf", "Berlin", "Belgium", "Paris", "Amsterdam",
        "Dusseldorf", "Berlin", "Belgium", "Paris", "Amsterdam", "Dusseldorf",
        "Berlin", "Belgium", "Paris", "Amsterdam", "Dusseldorf", "Berlin",
        "Belgium", "Paris", "Amsterdam", "Dusseldorf", "Berlin", "Belgium",
        "Paris"
    ],
    "Status": [
        "Legit", "Fraud", "Legit", "Legit", "Legit", "Fraud",
        "Legit", "Legit", "Legit", "Legit", "Legit", "Legit",
        "Legit", "Legit", "Legit", "Legit", "Fraud", "Legit",
        "Legit", "Legit", "Legit", "Legit", "Legit", "Legit",
        "Legit", "Legit", "Legit", "Legit", "Legit", "Legit",
        "Legit", "Legit", "Legit", "Legit", "Legit", "Fraud",
        "Legit", "Legit", "Legit", "Legit", "Legit", "Legit",
        "Legit", "Legit", "Legit", "Legit", "Fraud", "Legit",
        "Legit", "Legit", "Legit", "Legit", "Legit", "Legit",
        "Legit"
    ]
}

# Create DataFrame
df_bank_transactions = pd.DataFrame(data)
# Display the DataFrame
df_bank_transactions.head(10)
# date time location Status
0 1 21/08/2020 02:00 Amsterdam Legit
1 2 24/12/2020 05:19 Dusseldorf Fraud
2 3 10/04/2020 18:06 Berlin Legit
3 4 13/03/2020 19:01 Belgium Legit
4 5 08/10/2020 15:34 Paris Legit
5 6 02/04/2020 23:58 Amsterdam Fraud
6 7 15/05/2020 00:45 Dusseldorf Legit
7 8 18/07/2020 01:15 Berlin Legit
8 9 20/06/2020 02:30 Belgium Legit
9 10 22/08/2020 03:50 Paris Legit
# Calculate the number of 'Fraud' and 'Legit' observations
status_counts = df_bank_transactions['Status'].value_counts()

print(status_counts)
Status
Legit    50
Fraud     5
Name: count, dtype: int64

Fraud class is imbalanced as it contains only 5 observations. Imbalanced classes can create problems in ML classification if the difference between the minority and majority classes are significant. When we have a very few observations in one class and a lot of observations in another, we try to minimize the gap.

One of the ways to do so is by using oversampling technique SMOTE.

# Preprocess the data: Convert categorical variables to numeric
# Encoding 'location' and 'Status' for demonstration

le_location = LabelEncoder()
df_bank_transactions['location_encoded'] = le_location.fit_transform(df_bank_transactions['location'])

le_status = LabelEncoder()
df_bank_transactions['Status_encoded'] = le_status.fit_transform(df_bank_transactions['Status'])
df_bank_transactions.head()
# date time location Status location_encoded Status_encoded
0 1 21/08/2020 02:00 Amsterdam Legit 0 1
1 2 24/12/2020 05:19 Dusseldorf Fraud 3 0
2 3 10/04/2020 18:06 Berlin Legit 2 1
3 4 13/03/2020 19:01 Belgium Legit 1 1
4 5 08/10/2020 15:34 Paris Legit 4 1
# Define features and target variable
X = df_bank_transactions[['location_encoded']]  # Simplified feature set for demonstration
y = df_bank_transactions['Status_encoded']
# Apply SMOTE to balance the dataset
smote = SMOTE(sampling_strategy={0: 50}, k_neighbors=4)  # Class label '0' corresponds to 'Fraud'
X_res, y_res = smote.fit_resample(X, y)
# Check the new class distribution
print("New class distribution:", pd.Series(y_res).value_counts())
New class distribution: Status_encoded
1    50
0    50
Name: count, dtype: int64
# Optionally, convert results back to a DataFrame and map encoded values back to original
resampled_data = pd.DataFrame(X_res, columns=['location_encoded'])
resampled_data['Status'] = le_status.inverse_transform(y_res)
resampled_data.tail(10)
location_encoded Status
90 2 Fraud
91 0 Fraud
92 3 Fraud
93 1 Fraud
94 3 Fraud
95 0 Fraud
96 3 Fraud
97 2 Fraud
98 1 Fraud
99 3 Fraud