Filter Methods

Filter Methods#

This is a supplement material for the Machine Learning Simplified book. It sheds light on Python implementations of the topics discussed while all detailed explanations can be found in the book.
I also assume you know Python syntax and how it works. If you don’t, I highly recommend you to take a break and get introduced to the language before going forward with my code.
This material can be downloaded as a Jupyter notebook (Download button in the upper-right corner -> .ipynb) to reproduce the code and play around with it.

This notebook is a supplement for Chapter 3. Dimensionality Reduction Techniques of Machine Learning For Everyone book.

1. Required Libraries, Data & Variables#

Let’s import the data and have a look at it:

import pandas as pd

data = pd.read_csv('https://github.com/5x12/themlsbook/raw/master/supplements/data/car_price.csv', delimiter=',', header=0)

data.head()

	car_ID	symboling	CarName	fueltype	aspiration	doornumber	carbody	drivewheel	enginelocation	wheelbase	...	enginesize	fuelsystem	boreratio	stroke	compressionratio	horsepower	peakrpm	citympg	highwaympg	price
0	1	3	alfa-romero giulia	gas	std	two	convertible	rwd	front	88.6	...	130	mpfi	3.47	2.68	9.0	111	5000	21	27	13495.0
1	2	3	alfa-romero stelvio	gas	std	two	convertible	rwd	front	88.6	...	130	mpfi	3.47	2.68	9.0	111	5000	21	27	16500.0
2	3	1	alfa-romero Quadrifoglio	gas	std	two	hatchback	rwd	front	94.5	...	152	mpfi	2.68	3.47	9.0	154	5000	19	26	16500.0
3	4	2	audi 100 ls	gas	std	four	sedan	fwd	front	99.8	...	109	mpfi	3.19	3.40	10.0	102	5500	24	30	13950.0
4	5	2	audi 100ls	gas	std	four	sedan	4wd	front	99.4	...	136	mpfi	3.19	3.40	8.0	115	5500	18	22	17450.0

5 rows × 26 columns

data.columns

Index(['car_ID', 'symboling', 'CarName', 'fueltype', 'aspiration',
       'doornumber', 'carbody', 'drivewheel', 'enginelocation', 'wheelbase',
       'carlength', 'carwidth', 'carheight', 'curbweight', 'enginetype',
       'cylindernumber', 'enginesize', 'fuelsystem', 'boreratio', 'stroke',
       'compressionratio', 'horsepower', 'peakrpm', 'citympg', 'highwaympg',
       'price'],
      dtype='object')

Let’s define features \(X\) and a target variable \(y\):

data['price']=data['price'].astype('int')

X = data[['wheelbase', 
          'carlength', 
          'carwidth', 
          'carheight', 
          'curbweight', 
          'enginesize', 
          'boreratio', 
          'stroke',
          'compressionratio', 
          'horsepower', 
          'peakrpm', 
          'citympg', 
          'highwaympg']]

y = data['price']

Let’s split the data:

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1)

2. Filter methods#

The following Filter methods are examined:

Chi Square method
Fisher Score method
RelieF method
Correlation-based Feature Selection method

2.1. Chi-square#

# Importing required libraries
from sklearn.feature_selection import chi2

# Set and fit Chi-square feature selection
chi = chi2(X_train, y_train)

chi

(array([5.08315044e+01, 1.11205757e+02, 1.00159576e+01, 1.66003574e+01,
        1.42430375e+04, 1.87890909e+03, 3.04460495e+00, 4.27081156e+00,
        2.02528346e+02, 2.31340296e+03, 5.77758862e+03, 2.34366122e+02,
        2.09407540e+02]),
 array([1.00000000e+000, 9.33440717e-001, 1.00000000e+000, 1.00000000e+000,
        0.00000000e+000, 1.20242844e-304, 1.00000000e+000, 1.00000000e+000,
        1.51419631e-004, 0.00000000e+000, 0.00000000e+000, 2.47290251e-007,
        4.24387135e-005]))

# Create a list with feature label and its p-value
chi_features = pd.Series(chi[1], index = X_train.columns) # create a series with feature labels and their corresponding p-values
chi_features.sort_values(ascending = True, inplace = True) # sort series by p-values

# Return features with p-values
chi_features

curbweight           0.000000e+00
horsepower           0.000000e+00
peakrpm              0.000000e+00
enginesize          1.202428e-304
citympg              2.472903e-07
highwaympg           4.243871e-05
compressionratio     1.514196e-04
carlength            9.334407e-01
wheelbase            1.000000e+00
carwidth             1.000000e+00
carheight            1.000000e+00
boreratio            1.000000e+00
stroke               1.000000e+00
dtype: float64

# Print 4 best features
chi_features[:4]

curbweight     0.000000e+00
horsepower     0.000000e+00
peakrpm        0.000000e+00
enginesize    1.202428e-304
dtype: float64

# Print features whose p-value < 0.05
for feature_name, feature_score in zip(X.columns,chi[1]):
    if feature_score<0.05:
        print(feature_name, '\t', feature_score)

curbweight 	 0.0
enginesize 	 1.2024284431006599e-304
compressionratio 	 0.00015141963086236825
horsepower 	 0.0
peakrpm 	 0.0
citympg 	 2.4729025138749586e-07
highwaympg 	 4.243871349461334e-05

2.2. Fisher Score#

# Importing required libraries
from skfeature.function.similarity_based import fisher_score

# Set Fisher Score
score = fisher_score.fisher_score(X_train.values, y_train.values)
score

array([ 0,  8,  7, 10, 12,  3,  1,  2, 11,  5,  9,  6,  4])

# Create a list with feature label and its p-value
f_values = pd.Series(score, index = X_train.columns) # create a series with feature labels and their corresponding fisher scores
f_values.sort_values(ascending = True, inplace = True) # sort series by fisher score

f_values

wheelbase            0
boreratio            1
stroke               2
enginesize           3
highwaympg           4
horsepower           5
citympg              6
carwidth             7
carlength            8
peakrpm              9
carheight           10
compressionratio    11
curbweight          12
dtype: int64

2.3. RelieF#

# Importing required libraries
# ! pip install ReliefF
from ReliefF import ReliefF

# Set ReliefF method
fs = ReliefF(n_neighbors=1, n_features_to_keep=4)

# Perform ReliefF by fitting X and y values
fs.fit_transform(X_train.values, y_train.values)

# Make a ranking list with feature scores
relief_values = pd.Series(fs.feature_scores, index = X_train.columns) # create a series with feature labels and their corresponding ReliefF scores
relief_values.sort_values(ascending = True, inplace = True) # sort series by ReliefF score
relief_values

peakrpm            -105.0
boreratio           -21.0
stroke              -15.0
enginesize          -13.0
compressionratio     -9.0
horsepower           -5.0
wheelbase             3.0
carwidth              9.0
highwaympg           13.0
citympg              17.0
carlength            19.0
carheight            29.0
curbweight          109.0
dtype: float64

When using original Relief or ReliefF, it has been suggested that features yielding a negative value score, can be confidently filtered out. Now, feature \(horsepower\) is negative, which implies it is redundant. With some commonsense knowledge, we know the horsepower is one of the strongest parameters affecting the price of a car. That’s why you should be careful when applying this feature selection technique. The best way out is to try out several feature selection methods to see the general pattern.

# Print a ranking list with top 5 features
relief_features = []
for feature_name, feature_score in zip(X.columns,fs.feature_scores):
    if feature_score>15:
        relief_features.append(feature_name)
        print(feature_name, '\t', feature_score)

carlength 	 19.0
carheight 	 29.0
curbweight 	 109.0
citympg 	 17.0

# Selected features that satisfy criteria
relief_features

['carlength', 'carheight', 'curbweight', 'citympg']

2.4. Correlation-based Feature Selection#

#Correlation with output variable
cor = data[['wheelbase', 
          'carlength', 
          'carwidth', 
          'carheight', 
          'curbweight', 
          'enginesize', 
          'boreratio', 
          'stroke',
          'compressionratio', 
          'horsepower', 
          'peakrpm', 
          'citympg', 
          'highwaympg',
          'price']].corr()
cor_target = abs(cor['price'])

#Selecting highly correlated features > 0.8
relevant_features = cor_target[:-1][cor_target>0.8]
relevant_features

curbweight    0.835305
enginesize    0.874145
horsepower    0.808138
Name: price, dtype: float64

3. Comparing Four Methods#

print('The features selected by chi-square are: \n \n {} \n \n \n The features selected by f_values are: \n \n {} \n \n \n The features selected by ReliefF are: \n \n {} \n \n \n The features selected by Correlation-based feature selection method are: \n \n {}'.format(chi_features, f_values, relief_features, relevant_features))

The features selected by chi-square are: 
 
 curbweight           0.000000e+00
horsepower           0.000000e+00
peakrpm              0.000000e+00
enginesize          1.202428e-304
citympg              2.472903e-07
highwaympg           4.243871e-05
compressionratio     1.514196e-04
carlength            9.334407e-01
wheelbase            1.000000e+00
carwidth             1.000000e+00
carheight            1.000000e+00
boreratio            1.000000e+00
stroke               1.000000e+00
dtype: float64 
 
 
 The features selected by f_values are: 
 
 wheelbase            0
boreratio            1
stroke               2
enginesize           3
highwaympg           4
horsepower           5
citympg              6
carwidth             7
carlength            8
peakrpm              9
carheight           10
compressionratio    11
curbweight          12
dtype: int64 
 
 
 The features selected by ReliefF are: 
 
 ['carlength', 'carheight', 'curbweight', 'citympg'] 
 
 
 The features selected by Correlation-based feature selection method are: 
 
 curbweight    0.835305
enginesize    0.874145
horsepower    0.808138
Name: price, dtype: float64