Filter Methods#
This is a supplement material for the Machine Learning Simplified book. It sheds light on Python implementations of the topics discussed while all detailed explanations can be found in the book.
I also assume you know Python syntax and how it works. If you don’t, I highly recommend you to take a break and get introduced to the language before going forward with my code.
This material can be downloaded as a Jupyter notebook (Download button in the upper-right corner ->
.ipynb
) to reproduce the code and play around with it.
This notebook is a supplement for Chapter 3. Dimensionality Reduction Techniques of Machine Learning For Everyone book.
1. Required Libraries, Data & Variables#
Let’s import the data and have a look at it:
import pandas as pd
data = pd.read_csv('https://github.com/5x12/themlsbook/raw/master/supplements/data/car_price.csv', delimiter=',', header=0)
data.head()
car_ID | symboling | CarName | fueltype | aspiration | doornumber | carbody | drivewheel | enginelocation | wheelbase | ... | enginesize | fuelsystem | boreratio | stroke | compressionratio | horsepower | peakrpm | citympg | highwaympg | price | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 3 | alfa-romero giulia | gas | std | two | convertible | rwd | front | 88.6 | ... | 130 | mpfi | 3.47 | 2.68 | 9.0 | 111 | 5000 | 21 | 27 | 13495.0 |
1 | 2 | 3 | alfa-romero stelvio | gas | std | two | convertible | rwd | front | 88.6 | ... | 130 | mpfi | 3.47 | 2.68 | 9.0 | 111 | 5000 | 21 | 27 | 16500.0 |
2 | 3 | 1 | alfa-romero Quadrifoglio | gas | std | two | hatchback | rwd | front | 94.5 | ... | 152 | mpfi | 2.68 | 3.47 | 9.0 | 154 | 5000 | 19 | 26 | 16500.0 |
3 | 4 | 2 | audi 100 ls | gas | std | four | sedan | fwd | front | 99.8 | ... | 109 | mpfi | 3.19 | 3.40 | 10.0 | 102 | 5500 | 24 | 30 | 13950.0 |
4 | 5 | 2 | audi 100ls | gas | std | four | sedan | 4wd | front | 99.4 | ... | 136 | mpfi | 3.19 | 3.40 | 8.0 | 115 | 5500 | 18 | 22 | 17450.0 |
5 rows × 26 columns
data.columns
Index(['car_ID', 'symboling', 'CarName', 'fueltype', 'aspiration',
'doornumber', 'carbody', 'drivewheel', 'enginelocation', 'wheelbase',
'carlength', 'carwidth', 'carheight', 'curbweight', 'enginetype',
'cylindernumber', 'enginesize', 'fuelsystem', 'boreratio', 'stroke',
'compressionratio', 'horsepower', 'peakrpm', 'citympg', 'highwaympg',
'price'],
dtype='object')
Let’s define features \(X\) and a target variable \(y\):
data['price']=data['price'].astype('int')
X = data[['wheelbase',
'carlength',
'carwidth',
'carheight',
'curbweight',
'enginesize',
'boreratio',
'stroke',
'compressionratio',
'horsepower',
'peakrpm',
'citympg',
'highwaympg']]
y = data['price']
Let’s split the data:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1)
2. Filter methods#
The following Filter methods are examined:
Chi Square method
Fisher Score method
RelieF method
Correlation-based Feature Selection method
2.1. Chi-square#
# Importing required libraries
from sklearn.feature_selection import chi2
# Set and fit Chi-square feature selection
chi = chi2(X_train, y_train)
chi
(array([5.08315044e+01, 1.11205757e+02, 1.00159576e+01, 1.66003574e+01,
1.42430375e+04, 1.87890909e+03, 3.04460495e+00, 4.27081156e+00,
2.02528346e+02, 2.31340296e+03, 5.77758862e+03, 2.34366122e+02,
2.09407540e+02]),
array([1.00000000e+000, 9.33440717e-001, 1.00000000e+000, 1.00000000e+000,
0.00000000e+000, 1.20242844e-304, 1.00000000e+000, 1.00000000e+000,
1.51419631e-004, 0.00000000e+000, 0.00000000e+000, 2.47290251e-007,
4.24387135e-005]))
# Create a list with feature label and its p-value
chi_features = pd.Series(chi[1], index = X_train.columns) # create a series with feature labels and their corresponding p-values
chi_features.sort_values(ascending = True, inplace = True) # sort series by p-values
# Return features with p-values
chi_features
curbweight 0.000000e+00
horsepower 0.000000e+00
peakrpm 0.000000e+00
enginesize 1.202428e-304
citympg 2.472903e-07
highwaympg 4.243871e-05
compressionratio 1.514196e-04
carlength 9.334407e-01
wheelbase 1.000000e+00
carwidth 1.000000e+00
carheight 1.000000e+00
boreratio 1.000000e+00
stroke 1.000000e+00
dtype: float64
# Print 4 best features
chi_features[:4]
curbweight 0.000000e+00
horsepower 0.000000e+00
peakrpm 0.000000e+00
enginesize 1.202428e-304
dtype: float64
# Print features whose p-value < 0.05
for feature_name, feature_score in zip(X.columns,chi[1]):
if feature_score<0.05:
print(feature_name, '\t', feature_score)
curbweight 0.0
enginesize 1.2024284431006599e-304
compressionratio 0.00015141963086236825
horsepower 0.0
peakrpm 0.0
citympg 2.4729025138749586e-07
highwaympg 4.243871349461334e-05
2.2. Fisher Score#
# Importing required libraries
from skfeature.function.similarity_based import fisher_score
# Set Fisher Score
score = fisher_score.fisher_score(X_train.values, y_train.values)
score
array([ 0, 8, 7, 10, 12, 3, 1, 2, 11, 5, 9, 6, 4])
# Create a list with feature label and its p-value
f_values = pd.Series(score, index = X_train.columns) # create a series with feature labels and their corresponding fisher scores
f_values.sort_values(ascending = True, inplace = True) # sort series by fisher score
f_values
wheelbase 0
boreratio 1
stroke 2
enginesize 3
highwaympg 4
horsepower 5
citympg 6
carwidth 7
carlength 8
peakrpm 9
carheight 10
compressionratio 11
curbweight 12
dtype: int64
2.3. RelieF#
# Importing required libraries
# ! pip install ReliefF
from ReliefF import ReliefF
# Set ReliefF method
fs = ReliefF(n_neighbors=1, n_features_to_keep=4)
# Perform ReliefF by fitting X and y values
fs.fit_transform(X_train.values, y_train.values)
# Make a ranking list with feature scores
relief_values = pd.Series(fs.feature_scores, index = X_train.columns) # create a series with feature labels and their corresponding ReliefF scores
relief_values.sort_values(ascending = True, inplace = True) # sort series by ReliefF score
relief_values
peakrpm -105.0
boreratio -21.0
stroke -15.0
enginesize -13.0
compressionratio -9.0
horsepower -5.0
wheelbase 3.0
carwidth 9.0
highwaympg 13.0
citympg 17.0
carlength 19.0
carheight 29.0
curbweight 109.0
dtype: float64
When using original Relief or ReliefF, it has been suggested that features yielding a negative value score, can be confidently filtered out. Now, feature \(horsepower\) is negative, which implies it is redundant. With some commonsense knowledge, we know the horsepower is one of the strongest parameters affecting the price of a car. That’s why you should be careful when applying this feature selection technique. The best way out is to try out several feature selection methods to see the general pattern.
# Print a ranking list with top 5 features
relief_features = []
for feature_name, feature_score in zip(X.columns,fs.feature_scores):
if feature_score>15:
relief_features.append(feature_name)
print(feature_name, '\t', feature_score)
carlength 19.0
carheight 29.0
curbweight 109.0
citympg 17.0
# Selected features that satisfy criteria
relief_features
['carlength', 'carheight', 'curbweight', 'citympg']
2.4. Correlation-based Feature Selection#
#Correlation with output variable
cor = data[['wheelbase',
'carlength',
'carwidth',
'carheight',
'curbweight',
'enginesize',
'boreratio',
'stroke',
'compressionratio',
'horsepower',
'peakrpm',
'citympg',
'highwaympg',
'price']].corr()
cor_target = abs(cor['price'])
#Selecting highly correlated features > 0.8
relevant_features = cor_target[:-1][cor_target>0.8]
relevant_features
curbweight 0.835305
enginesize 0.874145
horsepower 0.808138
Name: price, dtype: float64
3. Comparing Four Methods#
print('The features selected by chi-square are: \n \n {} \n \n \n The features selected by f_values are: \n \n {} \n \n \n The features selected by ReliefF are: \n \n {} \n \n \n The features selected by Correlation-based feature selection method are: \n \n {}'.format(chi_features, f_values, relief_features, relevant_features))
The features selected by chi-square are:
curbweight 0.000000e+00
horsepower 0.000000e+00
peakrpm 0.000000e+00
enginesize 1.202428e-304
citympg 2.472903e-07
highwaympg 4.243871e-05
compressionratio 1.514196e-04
carlength 9.334407e-01
wheelbase 1.000000e+00
carwidth 1.000000e+00
carheight 1.000000e+00
boreratio 1.000000e+00
stroke 1.000000e+00
dtype: float64
The features selected by f_values are:
wheelbase 0
boreratio 1
stroke 2
enginesize 3
highwaympg 4
horsepower 5
citympg 6
carwidth 7
carlength 8
peakrpm 9
carheight 10
compressionratio 11
curbweight 12
dtype: int64
The features selected by ReliefF are:
['carlength', 'carheight', 'curbweight', 'citympg']
The features selected by Correlation-based feature selection method are:
curbweight 0.835305
enginesize 0.874145
horsepower 0.808138
Name: price, dtype: float64