Data Augmentation

Data Augmentation#

This is a supplement material for the Machine Learning Simplified book. It sheds light on Python implementations of the topics discussed while all detailed explanations can be found in the book.
I also assume you know Python syntax and how it works. If you don’t, I highly recommend you to take a break and get introduced to the language before going forward with my code.
This material can be downloaded as a Jupyter notebook (Download button in the upper-right corner -> .ipynb) to reproduce the code and play around with it.

This notebook is a supplement for Chapter 7. Data Preparation of Machine Learning For Everyone book.

1. Required Libraries, Data & Variables#

Let’s import needed libraries:

import pandas as pd
from imblearn.over_sampling import SMOTE
from sklearn.preprocessing import LabelEncoder

Imagine you obtained the following dataset from Silver Suchs Bank. This dataset contains 55 observations of bank transaction over a certain period of time. The target column \(Status\) has two classes: \(Fraud\) for fraudulent transactions and \(Legit\) for legal transactions. Imagine that out of 55 observations in the dataset, there are 50 legal transactions (class \(Legit\)) and only 5 fraudulent transactions (class \(Fraud\)). These two classes are imbalanced.

# Data for 55 transactions, out of which 5 are Fraud class
data = {
    "#": range(1, 56),
    "date": [
        "21/08/2020", "24/12/2020", "10/04/2020", "13/03/2020", "08/10/2020", "02/04/2020",
        "15/05/2020", "18/07/2020", "20/06/2020", "22/08/2020", "27/11/2020", "30/01/2020",
        "14/02/2020", "17/04/2020", "19/06/2020", "21/08/2020", "26/12/2020", "29/02/2020",
        "12/03/2020", "15/05/2020", "17/07/2020", "19/09/2020", "23/10/2020", "25/12/2020",
        "28/02/2020", "10/01/2020", "13/03/2020", "15/05/2020", "17/07/2020", "19/09/2020",
        "22/11/2020", "24/01/2020", "27/03/2020", "29/05/2020", "31/07/2020", "02/10/2020",
        "04/12/2020", "06/02/2020", "09/04/2020", "11/06/2020", "13/08/2020", "16/10/2020",
        "18/12/2020", "20/02/2020", "23/04/2020", "25/06/2020", "27/08/2020", "30/10/2020",
        "02/12/2020", "04/02/2020", "07/04/2020", "09/06/2020", "11/08/2020", "14/10/2020",
        "16/12/2020"
    ],
    "time": [
        "02:00", "05:19", "18:06", "19:01", "15:34", "23:58",
        "00:45", "01:15", "02:30", "03:50", "04:20", "05:45",
        "06:55", "07:25", "08:15", "09:35", "10:10", "11:20",
        "12:05", "13:30", "14:50", "15:40", "16:30", "17:20",
        "18:00", "19:10", "20:05", "21:15", "22:50", "23:30",
        "00:25", "01:35", "02:45", "03:55", "04:50", "05:10",
        "06:25", "07:35", "08:45", "09:55", "10:50", "11:00",
        "12:15", "13:25", "14:35", "15:45", "16:40", "17:50",
        "18:05", "19:15", "20:25", "21:35", "22:45", "23:55",
        "00:05"
    ],
    "location": [
        "Amsterdam", "Dusseldorf", "Berlin", "Belgium", "Paris", "Amsterdam",
        "Dusseldorf", "Berlin", "Belgium", "Paris", "Amsterdam", "Dusseldorf",
        "Berlin", "Belgium", "Paris", "Amsterdam", "Dusseldorf", "Berlin",
        "Belgium", "Paris", "Amsterdam", "Dusseldorf", "Berlin", "Belgium",
        "Paris", "Amsterdam", "Dusseldorf", "Berlin", "Belgium", "Paris",
        "Amsterdam", "Dusseldorf", "Berlin", "Belgium", "Paris", "Amsterdam",
        "Dusseldorf", "Berlin", "Belgium", "Paris", "Amsterdam", "Dusseldorf",
        "Berlin", "Belgium", "Paris", "Amsterdam", "Dusseldorf", "Berlin",
        "Belgium", "Paris", "Amsterdam", "Dusseldorf", "Berlin", "Belgium",
        "Paris"
    ],
    "Status": [
        "Legit", "Fraud", "Legit", "Legit", "Legit", "Fraud",
        "Legit", "Legit", "Legit", "Legit", "Legit", "Legit",
        "Legit", "Legit", "Legit", "Legit", "Fraud", "Legit",
        "Legit", "Legit", "Legit", "Legit", "Legit", "Legit",
        "Legit", "Legit", "Legit", "Legit", "Legit", "Legit",
        "Legit", "Legit", "Legit", "Legit", "Legit", "Fraud",
        "Legit", "Legit", "Legit", "Legit", "Legit", "Legit",
        "Legit", "Legit", "Legit", "Legit", "Fraud", "Legit",
        "Legit", "Legit", "Legit", "Legit", "Legit", "Legit",
        "Legit"
    ]
}

# Create DataFrame
df_bank_transactions = pd.DataFrame(data)

# Display the DataFrame
df_bank_transactions.head(10)

	#	date	time	location	Status
0	1	21/08/2020	02:00	Amsterdam	Legit
1	2	24/12/2020	05:19	Dusseldorf	Fraud
2	3	10/04/2020	18:06	Berlin	Legit
3	4	13/03/2020	19:01	Belgium	Legit
4	5	08/10/2020	15:34	Paris	Legit
5	6	02/04/2020	23:58	Amsterdam	Fraud
6	7	15/05/2020	00:45	Dusseldorf	Legit
7	8	18/07/2020	01:15	Berlin	Legit
8	9	20/06/2020	02:30	Belgium	Legit
9	10	22/08/2020	03:50	Paris	Legit

# Calculate the number of 'Fraud' and 'Legit' observations
status_counts = df_bank_transactions['Status'].value_counts()

print(status_counts)

Status
Legit    50
Fraud     5
Name: count, dtype: int64

Fraud class is imbalanced as it contains only 5 observations. Imbalanced classes can create problems in ML classification if the difference between the minority and majority classes are significant. When we have a very few observations in one class and a lot of observations in another, we try to minimize the gap.

One of the ways to do so is by using oversampling technique SMOTE.

# Preprocess the data: Convert categorical variables to numeric
# Encoding 'location' and 'Status' for demonstration

le_location = LabelEncoder()
df_bank_transactions['location_encoded'] = le_location.fit_transform(df_bank_transactions['location'])

le_status = LabelEncoder()
df_bank_transactions['Status_encoded'] = le_status.fit_transform(df_bank_transactions['Status'])

df_bank_transactions.head()

	#	date	time	location	Status	location_encoded	Status_encoded
0	1	21/08/2020	02:00	Amsterdam	Legit	0	1
1	2	24/12/2020	05:19	Dusseldorf	Fraud	3	0
2	3	10/04/2020	18:06	Berlin	Legit	2	1
3	4	13/03/2020	19:01	Belgium	Legit	1	1
4	5	08/10/2020	15:34	Paris	Legit	4	1

# Define features and target variable
X = df_bank_transactions[['location_encoded']]  # Simplified feature set for demonstration
y = df_bank_transactions['Status_encoded']

# Apply SMOTE to balance the dataset
smote = SMOTE(sampling_strategy={0: 50}, k_neighbors=4)  # Class label '0' corresponds to 'Fraud'
X_res, y_res = smote.fit_resample(X, y)

# Check the new class distribution
print("New class distribution:", pd.Series(y_res).value_counts())

New class distribution: Status_encoded
1    50
0    50
Name: count, dtype: int64

# Optionally, convert results back to a DataFrame and map encoded values back to original
resampled_data = pd.DataFrame(X_res, columns=['location_encoded'])
resampled_data['Status'] = le_status.inverse_transform(y_res)

resampled_data.tail(10)

	location_encoded	Status
90	2	Fraud
91	0	Fraud
92	3	Fraud
93	1	Fraud
94	3	Fraud
95	0	Fraud
96	3	Fraud
97	2	Fraud
98	1	Fraud
99	3	Fraud

Data Augmentation

Contents

Data Augmentation#

1. Required Libraries, Data & Variables#