Filter Methods#

  • This is a supplement material for the Machine Learning Simplified book. It sheds light on Python implementations of the topics discussed while all detailed explanations can be found in the book.

  • I also assume you know Python syntax and how it works. If you don’t, I highly recommend you to take a break and get introduced to the language before going forward with my code.

  • This material can be downloaded as a Jupyter notebook (Download button in the upper-right corner -> .ipynb) to reproduce the code and play around with it.

This notebook is a supplement for Chapter 3. Dimensionality Reduction Techniques of Machine Learning For Everyone book.

1. Required Libraries, Data & Variables#

Let’s import the data and have a look at it:

import pandas as pd

data = pd.read_csv('https://github.com/5x12/themlsbook/raw/master/supplements/data/car_price.csv', delimiter=',', header=0)
data.head()
car_ID symboling CarName fueltype aspiration doornumber carbody drivewheel enginelocation wheelbase ... enginesize fuelsystem boreratio stroke compressionratio horsepower peakrpm citympg highwaympg price
0 1 3 alfa-romero giulia gas std two convertible rwd front 88.6 ... 130 mpfi 3.47 2.68 9.0 111 5000 21 27 13495.0
1 2 3 alfa-romero stelvio gas std two convertible rwd front 88.6 ... 130 mpfi 3.47 2.68 9.0 111 5000 21 27 16500.0
2 3 1 alfa-romero Quadrifoglio gas std two hatchback rwd front 94.5 ... 152 mpfi 2.68 3.47 9.0 154 5000 19 26 16500.0
3 4 2 audi 100 ls gas std four sedan fwd front 99.8 ... 109 mpfi 3.19 3.40 10.0 102 5500 24 30 13950.0
4 5 2 audi 100ls gas std four sedan 4wd front 99.4 ... 136 mpfi 3.19 3.40 8.0 115 5500 18 22 17450.0

5 rows × 26 columns

data.columns
Index(['car_ID', 'symboling', 'CarName', 'fueltype', 'aspiration',
       'doornumber', 'carbody', 'drivewheel', 'enginelocation', 'wheelbase',
       'carlength', 'carwidth', 'carheight', 'curbweight', 'enginetype',
       'cylindernumber', 'enginesize', 'fuelsystem', 'boreratio', 'stroke',
       'compressionratio', 'horsepower', 'peakrpm', 'citympg', 'highwaympg',
       'price'],
      dtype='object')

Let’s define features \(X\) and a target variable \(y\):

data['price']=data['price'].astype('int')

X = data[['wheelbase', 
          'carlength', 
          'carwidth', 
          'carheight', 
          'curbweight', 
          'enginesize', 
          'boreratio', 
          'stroke',
          'compressionratio', 
          'horsepower', 
          'peakrpm', 
          'citympg', 
          'highwaympg']]

y = data['price']

Let’s split the data:

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1)

2. Filter methods#

The following Filter methods are examined:

  1. Chi Square method

  2. Fisher Score method

  3. RelieF method

  4. Correlation-based Feature Selection method

2.1. Chi-square#

# Importing required libraries
from sklearn.feature_selection import chi2
# Set and fit Chi-square feature selection
chi = chi2(X_train, y_train)
chi
(array([5.08315044e+01, 1.11205757e+02, 1.00159576e+01, 1.66003574e+01,
        1.42430375e+04, 1.87890909e+03, 3.04460495e+00, 4.27081156e+00,
        2.02528346e+02, 2.31340296e+03, 5.77758862e+03, 2.34366122e+02,
        2.09407540e+02]),
 array([1.00000000e+000, 9.33440717e-001, 1.00000000e+000, 1.00000000e+000,
        0.00000000e+000, 1.20242844e-304, 1.00000000e+000, 1.00000000e+000,
        1.51419631e-004, 0.00000000e+000, 0.00000000e+000, 2.47290251e-007,
        4.24387135e-005]))
# Create a list with feature label and its p-value
chi_features = pd.Series(chi[1], index = X_train.columns) # create a series with feature labels and their corresponding p-values
chi_features.sort_values(ascending = True, inplace = True) # sort series by p-values
# Return features with p-values
chi_features
curbweight           0.000000e+00
horsepower           0.000000e+00
peakrpm              0.000000e+00
enginesize          1.202428e-304
citympg              2.472903e-07
highwaympg           4.243871e-05
compressionratio     1.514196e-04
carlength            9.334407e-01
wheelbase            1.000000e+00
carwidth             1.000000e+00
carheight            1.000000e+00
boreratio            1.000000e+00
stroke               1.000000e+00
dtype: float64
# Print 4 best features
chi_features[:4]
curbweight     0.000000e+00
horsepower     0.000000e+00
peakrpm        0.000000e+00
enginesize    1.202428e-304
dtype: float64
# Print features whose p-value < 0.05
for feature_name, feature_score in zip(X.columns,chi[1]):
    if feature_score<0.05:
        print(feature_name, '\t', feature_score)
curbweight 	 0.0
enginesize 	 1.2024284431006599e-304
compressionratio 	 0.00015141963086236825
horsepower 	 0.0
peakrpm 	 0.0
citympg 	 2.4729025138749586e-07
highwaympg 	 4.243871349461334e-05

2.2. Fisher Score#

# Importing required libraries
from skfeature.function.similarity_based import fisher_score
# Set Fisher Score
score = fisher_score.fisher_score(X_train.values, y_train.values)
score
array([ 0,  8,  7, 10, 12,  3,  1,  2, 11,  5,  9,  6,  4])
# Create a list with feature label and its p-value
f_values = pd.Series(score, index = X_train.columns) # create a series with feature labels and their corresponding fisher scores
f_values.sort_values(ascending = True, inplace = True) # sort series by fisher score
f_values
wheelbase            0
boreratio            1
stroke               2
enginesize           3
highwaympg           4
horsepower           5
citympg              6
carwidth             7
carlength            8
peakrpm              9
carheight           10
compressionratio    11
curbweight          12
dtype: int64

2.3. RelieF#

# Importing required libraries
# ! pip install ReliefF
from ReliefF import ReliefF
# Set ReliefF method
fs = ReliefF(n_neighbors=1, n_features_to_keep=4)

# Perform ReliefF by fitting X and y values
fs.fit_transform(X_train.values, y_train.values)

# Make a ranking list with feature scores
relief_values = pd.Series(fs.feature_scores, index = X_train.columns) # create a series with feature labels and their corresponding ReliefF scores
relief_values.sort_values(ascending = True, inplace = True) # sort series by ReliefF score
relief_values
peakrpm            -105.0
boreratio           -21.0
stroke              -15.0
enginesize          -13.0
compressionratio     -9.0
horsepower           -5.0
wheelbase             3.0
carwidth              9.0
highwaympg           13.0
citympg              17.0
carlength            19.0
carheight            29.0
curbweight          109.0
dtype: float64

When using original Relief or ReliefF, it has been suggested that features yielding a negative value score, can be confidently filtered out. Now, feature \(horsepower\) is negative, which implies it is redundant. With some commonsense knowledge, we know the horsepower is one of the strongest parameters affecting the price of a car. That’s why you should be careful when applying this feature selection technique. The best way out is to try out several feature selection methods to see the general pattern.

# Print a ranking list with top 5 features
relief_features = []
for feature_name, feature_score in zip(X.columns,fs.feature_scores):
    if feature_score>15:
        relief_features.append(feature_name)
        print(feature_name, '\t', feature_score)
carlength 	 19.0
carheight 	 29.0
curbweight 	 109.0
citympg 	 17.0
# Selected features that satisfy criteria
relief_features
['carlength', 'carheight', 'curbweight', 'citympg']

2.4. Correlation-based Feature Selection#

#Correlation with output variable
cor = data[['wheelbase', 
          'carlength', 
          'carwidth', 
          'carheight', 
          'curbweight', 
          'enginesize', 
          'boreratio', 
          'stroke',
          'compressionratio', 
          'horsepower', 
          'peakrpm', 
          'citympg', 
          'highwaympg',
          'price']].corr()
cor_target = abs(cor['price'])

#Selecting highly correlated features > 0.8
relevant_features = cor_target[:-1][cor_target>0.8]
relevant_features
curbweight    0.835305
enginesize    0.874145
horsepower    0.808138
Name: price, dtype: float64

3. Comparing Four Methods#

print('The features selected by chi-square are: \n \n {} \n \n \n The features selected by f_values are: \n \n {} \n \n \n The features selected by ReliefF are: \n \n {} \n \n \n The features selected by Correlation-based feature selection method are: \n \n {}'.format(chi_features, f_values, relief_features, relevant_features))
The features selected by chi-square are: 
 
 curbweight           0.000000e+00
horsepower           0.000000e+00
peakrpm              0.000000e+00
enginesize          1.202428e-304
citympg              2.472903e-07
highwaympg           4.243871e-05
compressionratio     1.514196e-04
carlength            9.334407e-01
wheelbase            1.000000e+00
carwidth             1.000000e+00
carheight            1.000000e+00
boreratio            1.000000e+00
stroke               1.000000e+00
dtype: float64 
 
 
 The features selected by f_values are: 
 
 wheelbase            0
boreratio            1
stroke               2
enginesize           3
highwaympg           4
horsepower           5
citympg              6
carwidth             7
carlength            8
peakrpm              9
carheight           10
compressionratio    11
curbweight          12
dtype: int64 
 
 
 The features selected by ReliefF are: 
 
 ['carlength', 'carheight', 'curbweight', 'citympg'] 
 
 
 The features selected by Correlation-based feature selection method are: 
 
 curbweight    0.835305
enginesize    0.874145
horsepower    0.808138
Name: price, dtype: float64