Basis Expansion

  • This is a supplement material for the Machine Learning Simplified book. It sheds light on Python implementations of the topics discussed while all detailed explanations can be found in the book.

  • I also assume you know Python syntax and how it works. If you don’t, I highly recommend you to take a break and get introduced to the language before going forward with my code.

  • This material can be downloaded as a Jupyter notebook (Download button in the upper-right corner -> .ipynb) to reproduce the code and play around with it.

1. Required Libraries & Data

Before we start, we need to import few libraries that we will use in this jupyterbook.

import numpy as np
import matplotlib.pyplot as plt
%config InlineBackend.figure_format = 'retina' # sharper plots

# Defined data
X_train = [30, 46, 60, 65, 77, 95]
y_train = [31, 30, 80, 49, 70, 118]

X_test = [17, 40, 55, 57, 70, 85]
y_test = [19, 50, 60, 32, 90, 110]

Let’s visualize the data on the graph.

plt.figure(figsize=(6, 4))
plt.scatter(X_train, y_train, color='blue', label='Training set')
plt.scatter(X_test, y_test, color='red', label='Test set')
plt.title('The data')
plt.legend(loc='best')
<matplotlib.legend.Legend at 0x10f7fd790>
../_images/basis_expansion_3_1.png

2. Building Three Polynomial Models

2.1. First-degree polynomial

# build a model
degrees = 1
p = np.poly1d(np.polyfit(X_train, y_train, degrees))
t = np.linspace(0, 100, 100)

## visualization
#plot regression
plt.plot(X_train, y_train, 'o', t, p(t), '-')
#plot training dataset
plt.scatter(X_train, y_train, color='blue', label='Training set')
#plot test dataset
plt.scatter(X_test, y_test, color='red', label='Test set')

#plot configuration
plt.legend(loc='best')
plt.xlim((0,100))
plt.ylim((0,130))
plt.show()
../_images/basis_expansion_5_0.png

2.2. Second-degree polynomial

# build a model
degrees = 2
p = np.poly1d(np.polyfit(X_train, y_train, degrees))
t = np.linspace(0, 100, 100)

# visualize
plt.plot(X_train, y_train, 'o', t, p(t), '-')
plt.scatter(X_train, y_train, color='blue', label='Training set')
plt.scatter(X_test, y_test, color='red', label='Test set')
plt.legend(loc='best')
plt.xlim((0,100))
plt.ylim((0,130))
plt.show()
../_images/basis_expansion_7_0.png

Let’s see the estimated coefficients of the model

list(p.coef)
[0.014425999538340081, -0.4973416247674718, 31.898294657797386]

Let’s see their absolute sum:

sum(abs(p.coef))
32.4100622821032
#or 
31.9 + 0.5 + 0.014
32.414

We can use the built model p(t) if we want to predict the price of any apartment, given its area. Let’s predict the price of a 30-meter-squared apartment.

p(30) #in 10,000 -> 299,614
29.961445499279304
#alternatively:
def f(x):
    return np.array([(31.9 - 0.5 * i + 0.014 * i**2) for i in x])

2.2.1 Calculate SSR_training and SSR_test

predict_train = p(X_train)
SSR_train = sum((predict_train-y_train)**2)

predict_test = p(X_test)
SSR_test = sum((predict_test-y_test)**2)

print('SSR_train = {} \n \n SSR_test = {}'.format(SSR_train, SSR_test))
SSR_train = 994.7785614408572 
 
 SSR_test = 1530.3762231241067
predict_train = f(X_train)
SSR_train = sum((predict_train-y_train)**2)

predict_test = f(X_test)
SSR_test = sum((predict_test-y_test)**2)

print('SSR_train = {} \n \n SSR_test = {}'.format(SSR_train, SSR_test))
SSR_train = 1027.0004120000003 
 
 SSR_test = 1757.0769119999998

2.3. Fourth-degree polynomial

# build a model
degrees = 4
p = np.poly1d(np.polyfit(X_train, y_train, degrees))
t = np.linspace(0, 100, 100)

# visualize
plt.plot(X_train, y_train, 'o', t, p(t), '-')
plt.scatter(X_train, y_train, color='blue', label='Training set')
plt.scatter(X_test, y_test, color='red', label='Test set')
plt.legend(loc='best')
plt.ylim((0,120))
plt.show()
../_images/basis_expansion_20_0.png

Let’s see the estimated coefficients of the model

list(p.coef)
[8.489668977511541e-05,
 -0.020758975169594147,
 1.8214724130889242,
 -66.4626504642182,
 876.8597601245539]

Let’s see their absolute sum:

sum(abs(p.coef))
945.1647268737204
#alternatively:
def f(x):
    return np.array([(876.9-66.46*i+1.821*i**2-0.02076*i**3+0.0000849*i**4) for i in x])
f([30])
array([30.249])

We can use the built model p(t) if we want to predict the price of any apartment, given its area. Let’s predict the price of a 12-meter-squared apartment.

p(30)
30.579407116841026

Let’s calculate SSR_training and SSR_test:

predict_train = p(X_train)
SSR_train = sum((predict_train-y_train)**2)

predict_test = p(X_test)
SSR_test = sum((predict_test-y_test)**2)

print('SSR_train = {} \n \n SSR_test = {}'.format(SSR_train, SSR_test))
SSR_train = 651.4179373305931 
 
 SSR_test = 29010.616059824526
predict_train
array([ 30.57940712,  33.33905077,  62.72388224,  67.03384222,
        65.96521691, 118.35860073])
f(X_train)
array([ 30.249    ,  32.4166544,  61.044    ,  65.0280625,  63.0571009,
       113.7780625])
predict_train = f(X_train)
SSR_train = sum((predict_train-y_train)**2)

predict_test = f(X_test)
SSR_test = sum((predict_test-y_test)**2)

print('SSR_train = {} \n \n SSR_test = {}'.format(SSR_train, SSR_test))
SSR_train = 688.6615471596378 
 
 SSR_test = 29379.046097639017

2.4. Fifth-degree polynomial

# build a model
degrees = 5
p = np.poly1d(np.polyfit(X_train, y_train, degrees))
t = np.linspace(0, 100, 100)

# visualize
plt.plot(X_train, y_train, 'o', t, p(t), '-')
plt.scatter(X_train, y_train, color='blue', label='Training set')
plt.scatter(X_test, y_test, color='red', label='Test set')
plt.legend(loc='best')
plt.ylim((0,120))
plt.show()
../_images/basis_expansion_35_0.png

Let’s see the estimated coefficients of the model

list(p.coef)
[-3.0177085755377384e-05,
 0.00944944287510749,
 -1.1443256656628589,
 66.75349695585578,
 -1866.2074401186833,
 19915.12337120615]

Let’s see their absolute sum:

#alternatively:
def f(x):
    return np.array([(-3.017709e-05*i**5
                      +0.009449443*i**4
                      -1.144326*i**3
                      +66.7535*i**2
                      -1866.21*i
                      +19915.1) for i in x])

# #alternatively:
# def f(x):
#     return np.array([(876.9-66.46*i+1.821*i**2-0.02076*i**3+0.0000849*i**4) for i in x])
3.017709e-05+0.009449443+1.144326+66.7535+1866.21+19915.1
# + 4.430313e-05 + 0.001865759 + 0.24949 + 27.9861 + 996.46 + 12053.9
21849.217305620088
sum(abs(p.coef))
21849.238113566313

We can use the built model p(t) if we want to predict the price of any apartment, given its area. Let’s predict the price of a 12-meter-squared apartment.

p(12)
5344.177524015313
f([12])
array([5344.12329639])

Let’s calculate SSR_training and SSR_test:

predict_train = p(X_train)
SSR_train = sum((predict_train-y_train)**2)

predict_test = p(X_test)
SSR_test = sum((predict_test-y_test)**2)

print('SSR_train = {} \n \n SSR_test = {}'.format(SSR_train, SSR_test))
SSR_train = 3.163138662402778e-20 
 
 SSR_test = 6719065.318875373
predict_train = f(X_train)
SSR_train = sum((predict_train-y_train)**2)

predict_test = f(X_test)
SSR_test = sum((predict_test-y_test)**2)

print('SSR_train = {} \n \n SSR_test = {}'.format(SSR_train, SSR_test))
SSR_train = 0.6025432434314306 
 
 SSR_test = 6718669.713593046