IEOR 165 Discussion 6: Support Vector Machines

In this discussion we will take a look at two examples of using support vector machines (SVMs) for analyzing real world data sets. We will first consider a simple two dimensional example using NFL field goal data and then we will then analyze a famous data set (the MNIST data set), to see if we can use SVMs to identify numbers from hand written characters.

Before we begin let's review the theoretical foundations of SVMs. Over the majority of this course we have been looking at problems of regression , that is problems where we are interested in estimating a continuous relationship from or data (suc as a likelihood function or linear model). However, SVMs are designed to answer questions of classification (sometimes reffered to as questions of detection), in which we are interested in finding a discrete relationship in our data.

In the abstract setting we consider (just as with linear models) that our data comes in the form of $(x,y)$, however where before we assumed $y \in \mathbb{R}$ we will now assume that $y$ can belong to one of two classes $y \in \{-1,1\}$. When computing a linear SVM we are interested in finding a model of the form: $$y_i = \text{sign}(\beta^T x_i + \beta_0)\epsilon_i $$

Where $\epsilon$ is some i.i.d noise such that $\epsilon_i \in \{-1,1\}$. In words, we assume that the data is asigned a vlue of $-1$ or $1$ depending on what side of a hyperplane it is, but with some probability the sign of the label may be flipped which makes it difficult to infer the true value. As described in lecture we want to find the value of $\beta$ which creates the greatest separation between the inequalities: $$\beta^T x + \beta_0 \geq 1 $$ $$\beta^T x + \beta_0 \leq 1 $$

This is equivalent to finding the values of $\beta$ which cause the greatest separation between the data and the linear function $\beta^T x + \beta_0 = 0$. In the classification community the distance between the closest data point and this line is called the margin of the classifier, and hence SVMs are referred to as maximum margin estimators. In general the optimization problem we need to solve for fitting a linear SVM is given by the from: $$ \min_\beta\{ \|\beta\|_2^2 - \lambda\sum_{i=1}^n u_i : y_i(\beta^Tx_i + \beta_0) \geq 1 - u_i \quad \forall i = 1,...,n\}$$

Forutnately, solving this optimization problem can be done using specialized software packages (which in fact often solve the dual form). In this discussion we will focus on how we can implement this sort of classifier to analyze real world data.

Example 1: Predicting Success of Field Goal Kick

Consider the following data set of all field goal attempts in the 2003 season of the NFL.

In [152]:
import pandas as pd
nfl_data = pd.read_excel('nfl_kick_data.xlsx')
nfl_data.iloc[:20,:]
Out[152]:
Yards Success Week
0 30 1 1
1 41 0 1
2 50 0 1
3 22 1 1
4 33 1 1
5 44 0 1
6 40 0 1
7 55 0 1
8 49 0 1
9 51 0 1
10 27 1 1
11 39 0 1
12 26 1 1
13 38 0 1
14 36 0 1
15 50 0 1
16 23 1 1
17 24 1 1
18 35 0 1
19 43 0 1

Suppose we wanted to construct a linear SVM model to analyze this data, what should be our predictors and what should be our lables?

One option is to predict the success of each kick based on which week of the season the attempt was performed and how many yards the kick was attempted for. As a linear SVM model this would look like: $$\text{success}_i = \text{sign}(\beta_1 \text{yards}_i + \beta_2\text{week}_i + \beta_0)\epsilon_i $$

Before we begin our analysis we need to process our data so that it fits the SVM paradigm, this means to first ensure that our labels are $\{-1,1\}$. We note that Success seems to be $\{0,1\}$ but we can easily transform it by adding a new collumn which we will call $\text{Success_svm} = 2\cdot \text{Success} - 1$, this transforms all 0s to -1s and all 1s remain 1s. In general it is convention to keep successes (or objects of detection) as 1s and failures (or undesireable cases) as -1s. Doing this transformation gives us the following augmented dataset:

In [153]:
nfl_data['Success_svm'] = 2*nfl_data['Success'] -1
nfl_data.iloc[:20,:]
Out[153]:
Yards Success Week Success_svm
0 30 1 1 1
1 41 0 1 -1
2 50 0 1 -1
3 22 1 1 1
4 33 1 1 1
5 44 0 1 -1
6 40 0 1 -1
7 55 0 1 -1
8 49 0 1 -1
9 51 0 1 -1
10 27 1 1 1
11 39 0 1 -1
12 26 1 1 1
13 38 0 1 -1
14 36 0 1 -1
15 50 0 1 -1
16 23 1 1 1
17 24 1 1 1
18 35 0 1 -1
19 43 0 1 -1

Now we are almost ready to begin our analysis. Recall that models can be sensative to issues of scaling in our data and the values of Yards are seginficantly larger then those of Week. We will normalize Yards and add it as the collumn Yards_norm to ensure we don't have scaling issues.

In [154]:
nfl_data['Yards_norm'] = nfl_data['Yards']/max(nfl_data['Yards'])


nfl_data.iloc[:20,:]
Out[154]:
Yards Success Week Success_svm Yards_norm
0 30 1 1 1 0.483871
1 41 0 1 -1 0.661290
2 50 0 1 -1 0.806452
3 22 1 1 1 0.354839
4 33 1 1 1 0.532258
5 44 0 1 -1 0.709677
6 40 0 1 -1 0.645161
7 55 0 1 -1 0.887097
8 49 0 1 -1 0.790323
9 51 0 1 -1 0.822581
10 27 1 1 1 0.435484
11 39 0 1 -1 0.629032
12 26 1 1 1 0.419355
13 38 0 1 -1 0.612903
14 36 0 1 -1 0.580645
15 50 0 1 -1 0.806452
16 23 1 1 1 0.370968
17 24 1 1 1 0.387097
18 35 0 1 -1 0.564516
19 43 0 1 -1 0.693548

Now let's construct our SVM Model:

In [155]:
from sklearn import svm
import numpy as np

X = nfl_data[['Yards_norm','Week']]
Y = nfl_data['Success_svm']
X = pd.DataFrame.as_matrix(X)
Y = pd.DataFrame.as_matrix(Y)
model = svm.SVC(kernel='linear')
model.fit(X,Y)
Out[155]:
SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape=None, degree=3, gamma='auto', kernel='linear',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False)

We can inspect the results of our model by plotting a scatter plot of the data and seeing where our linear separator lies in the plane.

In [156]:
%matplotlib inline
import matplotlib.pyplot as plt
fig = plt.figure()
beta_vals = model.coef_[0]
print(beta_vals)
print(beta_null)
beta_null = model.intercept_
xx = np.linspace(0.2,.6,1000)
plt.plot(xx, -beta_vals[0]/beta_vals[1]*xx - beta_null/beta_vals[1])
plt.scatter(X[:, 0], X[:, 1], c=Y, alpha=0.8)
plt.autoscale(tight=True)    
[-9.79436688 -0.26329243]
[-0.99795306]

Example 2: Using SVM to Identify Handwritten Characters from Images (using MNIST data set)

In this example we will identify different handwritten numbers using the MNIST (Mixed National Institute of Standards and Technology) data set. MNIST is a famous data set used to calibrate various machine learning and statistical methods to identify numerical characters based on hand written black and white 28x28 pixel images which were provided by high school students employed by the US census beaureaux. Here is what these images look like:

The data we have is the label of the image (which number it is supposed to be) and the shade value of each pixel in the picture. The full training data set for MNIST contains over 60,000 different images but below we have an example of the first 20.

In [32]:
import pandas as pd
mnist_train_data = pd.read_csv('mnist_train.csv',names=range(785))
In [40]:
mnist_train_data.iloc[:20,:]
Out[40]:
0 1 2 3 4 5 6 7 8 9 ... 775 776 777 778 779 780 781 782 783 784
0 5 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
1 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
2 4 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
3 1 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
4 9 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
5 2 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
6 1 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
7 3 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
8 1 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
9 4 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
10 3 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
11 5 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
12 3 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
13 6 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
14 1 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
15 7 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
16 2 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
17 8 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
18 6 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
19 9 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0

20 rows × 785 columns

For this example we will construct an svm model to distinguish the number 3 from the number 7 based on hand written image data. So first lets only take the data labled 3 and 7 from our data set.

In [37]:
#index_vlas = (mnist_train_data.iloc[:,0] == 7) or (mnist_train_data.iloc[:,0] == 3)

data_37 = mnist_train_data[(mnist_train_data[0] == 7) | (mnist_train_data[0] == 3) ]
In [41]:
data_37.iloc[:20,:]
Out[41]:
0 1 2 3 4 5 6 7 8 9 ... 775 776 777 778 779 780 781 782 783 784
7 3 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
10 3 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
12 3 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
15 7 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
27 3 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
29 7 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
30 3 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
38 7 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
42 7 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
44 3 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
49 3 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
50 3 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
52 7 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
71 7 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
74 3 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
79 7 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
84 7 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
86 3 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
91 7 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
96 7 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0

20 rows × 785 columns

Let us construct and SVM model which will detect wether or not we have an image which corresponds to the number 3. This means we are interested in a model of the form:

$$\text{is 3}_i = \text{sign}(\sum_{j=1}^{784}\text{pixel}_{i,j}\cdot\beta_{j} + \beta_0)\epsilon_i$$

Again let us do some pre processing and ensure that our SVM labels are following our paradigm of $\{-1,1\}$ so lets add a new collumn 'is_3' which will be one of the label is 3 and -1 if the label is 7.

In [101]:
data_37['is_3'] = 2*(data_37.iloc[:,0] == 3) - 1

data_37.iloc[:20,:]
C:\Users\ymintz\Anaconda3\lib\site-packages\ipykernel\__main__.py:1: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  if __name__ == '__main__':
Out[101]:
0 1 2 3 4 5 6 7 8 9 ... 776 777 778 779 780 781 782 783 784 is_3
7 3 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 1
10 3 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 1
12 3 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 1
15 7 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 -1
27 3 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 1
29 7 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 -1
30 3 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 1
38 7 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 -1
42 7 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 -1
44 3 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 1
49 3 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 1
50 3 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 1
52 7 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 -1
71 7 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 -1
74 3 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 1
79 7 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 -1
84 7 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 -1
86 3 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 1
91 7 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 -1
96 7 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 -1

20 rows × 786 columns

Now let's train a linear SVM classifier on this data and see how it performs.

In [106]:
var_names = list(data_37)
X = data_37[var_names[1:-1]]
Y = data_37['is_3']
X = pd.DataFrame.as_matrix(X)
Y = pd.DataFrame.as_matrix(Y)
model = svm.SVC(kernel='linear')
model.fit(X,Y)

print(model.coef_[0])
print(model.intercept_)
[  0.00000000e+00   0.00000000e+00   0.00000000e+00   0.00000000e+00
   0.00000000e+00   0.00000000e+00   0.00000000e+00   0.00000000e+00
   0.00000000e+00   0.00000000e+00   0.00000000e+00   0.00000000e+00
   0.00000000e+00   0.00000000e+00   0.00000000e+00   0.00000000e+00
   0.00000000e+00   0.00000000e+00   0.00000000e+00   0.00000000e+00
   0.00000000e+00   0.00000000e+00   0.00000000e+00   0.00000000e+00
   0.00000000e+00   0.00000000e+00   0.00000000e+00   0.00000000e+00
   0.00000000e+00   0.00000000e+00   0.00000000e+00   0.00000000e+00
   0.00000000e+00   0.00000000e+00   0.00000000e+00   0.00000000e+00
   0.00000000e+00   0.00000000e+00   0.00000000e+00   0.00000000e+00
   0.00000000e+00   0.00000000e+00   0.00000000e+00   0.00000000e+00
   0.00000000e+00   0.00000000e+00   0.00000000e+00   0.00000000e+00
   0.00000000e+00   0.00000000e+00   0.00000000e+00   0.00000000e+00
   0.00000000e+00   0.00000000e+00   0.00000000e+00   0.00000000e+00
   0.00000000e+00   0.00000000e+00   0.00000000e+00   0.00000000e+00
   0.00000000e+00   0.00000000e+00   0.00000000e+00   0.00000000e+00
   0.00000000e+00   0.00000000e+00   0.00000000e+00   0.00000000e+00
   0.00000000e+00   0.00000000e+00   0.00000000e+00   0.00000000e+00
   0.00000000e+00   0.00000000e+00   0.00000000e+00   0.00000000e+00
   0.00000000e+00   0.00000000e+00   0.00000000e+00   0.00000000e+00
   0.00000000e+00   0.00000000e+00   0.00000000e+00   0.00000000e+00
   0.00000000e+00   0.00000000e+00   0.00000000e+00   0.00000000e+00
   0.00000000e+00   0.00000000e+00   0.00000000e+00   3.40532217e-03
   1.72106823e-02   1.86905785e-03   4.88553661e-03   1.05889507e-02
   1.44206643e-02   5.43985136e-03   1.10204199e-02   5.45322326e-03
   1.42912721e-04   8.63583264e-03   1.64013876e-02   1.60666654e-03
   7.02146389e-05   1.70521266e-04   1.03650181e-04   0.00000000e+00
   0.00000000e+00   0.00000000e+00   0.00000000e+00   0.00000000e+00
   0.00000000e+00   0.00000000e+00   0.00000000e+00   0.00000000e+00
   0.00000000e+00   2.54666995e-04   0.00000000e+00   1.77628967e-02
   1.00173308e-02   3.13799734e-03   1.10787201e-02   5.16906222e-03
   6.75734245e-03   7.68638269e-03  -1.47964887e-03   9.51647256e-03
   1.39628109e-02   3.58843526e-03   1.84256001e-02   6.56055503e-03
   1.33835834e-03   5.09743503e-03   3.03669529e-03   0.00000000e+00
   0.00000000e+00   0.00000000e+00   0.00000000e+00   0.00000000e+00
   0.00000000e+00   0.00000000e+00   0.00000000e+00   3.83786059e-04
   7.74879370e-03  -1.44720354e-02   1.30489738e-03   5.52956961e-03
  -1.34281816e-03  -5.14036322e-03   1.32012464e-03   1.12648310e-03
   2.72851593e-03  -1.52996293e-03  -2.89961928e-03   7.15336850e-03
  -6.78022324e-04   1.69319156e-03  -3.76517255e-03   1.28474864e-03
   7.23013327e-03   1.30928302e-02   5.73663096e-03   8.57864176e-03
   2.90883776e-03   0.00000000e+00   0.00000000e+00   0.00000000e+00
   0.00000000e+00   0.00000000e+00   8.65179653e-03  -7.03994493e-03
  -1.72628009e-03  -4.56770011e-04  -4.30183404e-03  -5.23101078e-03
  -3.91008432e-03  -1.02148322e-03   6.94212340e-03  -2.63372118e-03
  -3.72604924e-03   5.42288599e-03  -1.00271464e-02   6.67786050e-03
  -1.42313007e-03   2.76073739e-03  -2.31231028e-03   4.26022851e-03
   7.26344228e-03   1.94910423e-03   2.30266951e-03  -3.91373157e-03
   3.73413459e-03   0.00000000e+00   0.00000000e+00   0.00000000e+00
   0.00000000e+00  -5.99055330e-05  -6.14044758e-03  -1.86378522e-03
  -9.51579806e-03  -4.14450077e-03  -5.66269145e-05   1.63615132e-03
   2.15166296e-03  -2.60946416e-03   3.23389130e-04  -5.37423644e-04
   1.43265602e-03  -8.85903957e-03   1.40775173e-02  -6.06381303e-03
   1.38099822e-03  -6.07991094e-03  -4.20012360e-03  -3.39290526e-03
  -7.26851038e-03  -1.01399198e-02  -6.77379343e-03  -3.56444385e-03
   7.37483280e-04   0.00000000e+00   0.00000000e+00   0.00000000e+00
   0.00000000e+00  -6.27617481e-04   2.13289640e-04   4.14672995e-03
  -1.73371721e-03   3.50388484e-04  -3.38504167e-03  -9.39146077e-03
   3.03808789e-03  -9.39389295e-03  -4.48650994e-03   3.56928544e-03
  -3.27896618e-03   6.15932012e-03  -9.04336696e-03   1.17849735e-03
  -3.76101523e-03   2.44222458e-03   6.60359122e-03  -3.63841011e-04
   3.21686746e-03  -8.62804475e-04   2.35057875e-04  -2.53324143e-03
  -2.14133138e-03   0.00000000e+00   0.00000000e+00   0.00000000e+00
   0.00000000e+00  -9.54209361e-04  -1.19052857e-02   4.24048622e-03
  -1.03288533e-03  -1.39444536e-03   9.21400720e-03  -4.32555567e-03
   1.21921710e-03   2.31582343e-03   6.45580629e-03  -6.13601083e-03
   1.74533373e-03  -4.82154628e-03   3.04665898e-03  -2.84544181e-03
   4.17442167e-03  -1.69446670e-04  -1.03608862e-02  -2.06590452e-03
  -1.21649747e-03  -2.08996756e-03  -2.38827223e-03   6.72601644e-03
  -7.49455480e-03  -4.21249358e-04  -7.22141756e-04  -9.44339220e-04
   0.00000000e+00  -4.68828267e-04  -4.81683043e-03   1.35374953e-04
  -1.38438057e-02  -3.29350243e-03  -7.99153493e-03  -8.55550490e-04
  -1.79236310e-03  -9.59906513e-03   1.15911857e-03  -2.01440858e-03
   7.18865868e-03  -2.76630471e-03  -1.61777016e-03  -3.89504327e-03
  -5.73218957e-03   2.11577318e-03   9.90935269e-04   1.42592252e-03
   1.05696653e-02  -4.29683171e-03  -3.33557316e-03   2.61717586e-02
  -4.04691935e-03  -1.05080884e-03  -6.38817708e-04  -2.54601260e-04
   0.00000000e+00  -4.54833393e-05  -3.04160768e-05  -1.27282140e-02
   5.35148722e-03   1.68605378e-02  -1.19219143e-02   8.95139377e-03
  -8.93676200e-03   1.06190164e-03  -4.95649980e-03  -7.22047076e-03
  -8.87220664e-03   1.18884673e-02  -9.71531870e-03   4.45679454e-03
  -4.09465665e-04  -1.07985035e-02   3.99938849e-03  -5.04592565e-03
  -1.07246672e-02  -3.89845295e-03  -1.53082122e-02   7.48403022e-03
  -1.51250773e-02  -6.48075935e-05   0.00000000e+00   0.00000000e+00
   0.00000000e+00   0.00000000e+00  -6.79590712e-03  -4.28316494e-03
  -2.46949531e-03  -1.28100088e-02  -2.63198078e-03  -5.22433504e-03
  -3.83699900e-03   7.70992792e-03   2.36090288e-03   7.65982118e-03
   4.42326464e-03  -4.60580228e-03   5.26583471e-03  -1.82252716e-03
  -5.48947526e-03   5.84831162e-03  -3.83247568e-03  -1.85636428e-03
   6.36167783e-03   4.93015479e-03  -1.30948037e-02   1.67044117e-02
  -1.85066490e-02   0.00000000e+00   0.00000000e+00   0.00000000e+00
   0.00000000e+00   0.00000000e+00  -4.12631949e-03  -7.47341294e-03
   6.94931814e-03  -7.28445633e-03  -6.08728441e-04   7.71819887e-04
   4.52608558e-03  -7.85898760e-03  -7.89285504e-03   3.12284001e-03
   4.66582062e-03   3.32330748e-03   5.01755109e-03  -7.18412345e-03
   8.44396343e-03  -5.82201857e-03   4.63833809e-03   4.19210390e-03
   9.51057536e-04  -1.20561015e-02  -2.85729992e-03   2.59145597e-03
  -1.09461160e-02   0.00000000e+00   0.00000000e+00   0.00000000e+00
   0.00000000e+00   0.00000000e+00  -2.48156277e-03  -5.78773660e-03
   8.87856221e-03  -8.02428549e-03   1.06733784e-02  -8.61827372e-03
  -8.16884436e-03   5.95326066e-03  -1.61538575e-03  -1.69910939e-03
   5.30436910e-03   5.38856862e-03   3.69184340e-03  -9.73362314e-04
   4.24643600e-03  -6.62500907e-03  -1.54250200e-03  -6.46417484e-03
  -4.46907103e-03   1.04147012e-02  -1.16892927e-02  -5.14717588e-03
  -7.46671588e-03   0.00000000e+00   0.00000000e+00   0.00000000e+00
   0.00000000e+00   0.00000000e+00   0.00000000e+00  -1.05822895e-02
  -2.36150055e-03  -1.88452643e-03   5.08313211e-03  -9.12156731e-03
   1.86105431e-03   1.91744142e-03  -3.54883849e-04  -4.43050606e-03
   7.92211499e-04  -1.60049467e-03   5.58667668e-03   2.76755517e-03
  -1.23525008e-03   3.33912316e-03  -7.11102920e-03   6.42551177e-03
   4.67388917e-03  -4.84293345e-03  -1.99514579e-03  -5.29819456e-03
   2.57842439e-03  -4.30764169e-06   0.00000000e+00   0.00000000e+00
   0.00000000e+00   0.00000000e+00   0.00000000e+00  -1.24138331e-03
   1.01661032e-03  -7.02299172e-03  -4.02593859e-03  -1.00916292e-02
   2.80279116e-03  -7.78789426e-03   5.26306295e-03   7.53358457e-04
   5.12076644e-03   5.17905942e-03  -3.61480398e-03  -2.33162070e-03
  -5.43682898e-04  -4.44139382e-03   7.70587382e-03  -8.30404568e-03
   4.41824537e-03  -6.48200136e-03   1.02563210e-02   5.23371970e-03
   7.40616240e-03   0.00000000e+00   0.00000000e+00   0.00000000e+00
   0.00000000e+00   0.00000000e+00   0.00000000e+00   1.25660483e-02
   1.54812015e-02   4.95449819e-03  -2.10022741e-03   9.52108118e-03
  -3.73746344e-03   4.12671368e-03   3.00701346e-03  -6.29060323e-03
   8.36623473e-04  -1.43110452e-04  -4.25406626e-03  -4.05799949e-03
  -7.54319670e-03   4.40952458e-03  -5.17103615e-03   4.14021273e-03
  -4.48376325e-03   4.14762048e-03   4.21043540e-03   5.10466373e-03
  -1.25706655e-02  -1.99966788e-03   0.00000000e+00   0.00000000e+00
   0.00000000e+00   0.00000000e+00   0.00000000e+00   9.04449659e-03
   2.03285117e-03   5.61515062e-03   9.96824000e-03   5.60090197e-03
  -2.27541493e-05   9.69997168e-03  -7.57051130e-03   8.64340322e-03
  -4.37309311e-04  -5.40668704e-03   3.43912828e-03  -7.52892577e-04
   8.67161081e-03   9.11462942e-04   3.43261120e-04  -9.10621712e-04
   6.33890957e-03   7.15687861e-03  -2.50988137e-03  -1.26818175e-03
  -1.56094620e-02  -3.26189924e-04   0.00000000e+00   0.00000000e+00
   0.00000000e+00   0.00000000e+00   0.00000000e+00   1.13224411e-02
  -4.50830893e-03   8.51571834e-03   1.77769861e-03  -1.00089904e-03
   5.00994790e-03  -6.62620821e-03   6.16099329e-03  -1.37121705e-02
   4.30220279e-03   5.20851908e-03  -3.32049118e-03   4.22855662e-03
   1.32115542e-03  -4.54596599e-03   1.21792956e-02   9.90261473e-04
   3.79723634e-03  -7.27013144e-03  -2.15398045e-03   1.60600178e-04
  -1.34863970e-02   0.00000000e+00   0.00000000e+00   0.00000000e+00
   0.00000000e+00   0.00000000e+00   3.91476109e-03   9.44438231e-03
   1.33236471e-02   4.57391554e-03   9.74754418e-03  -1.39104780e-03
  -2.48686057e-03   5.64074584e-03  -2.90781125e-03  -9.46282252e-04
  -9.32711568e-03   3.65252978e-03   6.78487344e-03  -7.18169104e-04
   3.49723915e-03   7.24857746e-03  -1.11321863e-03   6.24032819e-03
   7.17404195e-03   1.54083898e-02   1.26034625e-02   3.74063460e-03
  -5.82051233e-03   0.00000000e+00   0.00000000e+00   0.00000000e+00
   0.00000000e+00   3.00462413e-03   1.57671555e-02   1.97753201e-02
   2.12340841e-03   2.42043535e-03  -2.47612010e-03   1.87074374e-02
   1.74433350e-03  -2.44571988e-03   7.34477521e-03   1.57343943e-03
  -3.06258293e-03  -3.37939650e-03   1.34451313e-02  -6.75054813e-05
   7.07571983e-03   2.19661335e-03   5.37306407e-03   1.31029256e-03
   1.35761972e-02  -2.20524563e-03   2.44524167e-03   4.04613443e-03
  -6.68193989e-03   0.00000000e+00   0.00000000e+00   0.00000000e+00
   0.00000000e+00   0.00000000e+00   4.91325734e-03   1.77562073e-02
  -9.00948836e-03  -1.01726524e-02   3.14887104e-03   4.76220183e-03
  -1.93268213e-03  -3.29970335e-03  -8.00946769e-04  -1.48060998e-04
  -2.82427483e-03  -6.95215070e-03  -8.83449015e-03   5.47170700e-04
   4.02751366e-03   5.55532782e-03  -2.49038582e-03   4.28521214e-03
  -1.54405684e-02   8.04054044e-03  -1.32726110e-03   5.90989874e-03
   0.00000000e+00   0.00000000e+00   0.00000000e+00   0.00000000e+00
   0.00000000e+00   0.00000000e+00   6.01831871e-03   2.45783279e-02
  -7.21521849e-04   2.71546971e-02  -6.07308750e-03  -5.65279784e-03
   1.59826547e-02  -1.62125759e-03   2.12620411e-03   1.43454026e-02
  -5.39715202e-04   6.45887790e-03   2.18673922e-03   2.82754941e-03
  -7.24826255e-03   1.23492191e-02   1.13091175e-02   1.48114261e-02
   1.16203790e-02   8.67931156e-03   1.36219870e-02   4.47371070e-03
   3.70025506e-03   1.91392503e-04   0.00000000e+00   0.00000000e+00
   0.00000000e+00   0.00000000e+00  -7.15080088e-03   7.61329016e-03
  -4.82044623e-03  -3.24861529e-03   3.93244753e-03   6.39905532e-03
  -7.12667158e-04   5.74185668e-03   3.19124516e-03  -2.91987866e-03
   8.57699513e-03   4.47530543e-04  -6.83148583e-04   2.12375753e-03
   4.80423170e-04  -9.12480558e-03  -1.06800692e-02  -1.08906280e-02
  -6.73703953e-04  -8.69099764e-03   5.69136927e-03   1.95047198e-03
   0.00000000e+00   0.00000000e+00   0.00000000e+00   0.00000000e+00
   0.00000000e+00   0.00000000e+00   0.00000000e+00  -3.59515980e-03
  -1.18666746e-02  -4.50697769e-03   2.16392409e-03  -2.79430786e-03
   1.36489967e-02  -3.68668941e-04   3.86335460e-03   4.90509652e-03
  -1.55504331e-04   9.39509982e-03  -4.18456402e-03   5.04552689e-03
   1.55088889e-03   4.33199580e-03   4.06369528e-03   1.59533880e-02
  -7.09055874e-04  -2.22961045e-02  -2.18704389e-03   0.00000000e+00
   0.00000000e+00   0.00000000e+00   0.00000000e+00   0.00000000e+00
   0.00000000e+00   0.00000000e+00   0.00000000e+00   0.00000000e+00
   2.89371243e-04   8.35293604e-03   1.57506152e-02  -4.78925631e-03
   4.46980142e-03  -1.05903626e-02   6.31682761e-03  -7.69542318e-04
   1.83090177e-03   4.58752917e-04  -3.73798305e-03   7.57121867e-03
   5.85618090e-03  -1.41878327e-03   7.65090134e-04  -5.03442272e-03
   2.51642071e-03   4.73337867e-04  -2.71391243e-04  -2.31917163e-03
   0.00000000e+00   0.00000000e+00   0.00000000e+00   0.00000000e+00
   0.00000000e+00   0.00000000e+00   0.00000000e+00   0.00000000e+00
   0.00000000e+00   0.00000000e+00   0.00000000e+00   0.00000000e+00
   0.00000000e+00  -2.08831547e-04  -2.63308990e-03  -1.16190445e-03
  -1.99076072e-04  -1.31251256e-03  -7.46059771e-04  -6.25652569e-04
  -1.34501360e-03  -4.00791886e-04  -6.04008860e-04  -2.42562288e-03
  -9.20394454e-04   0.00000000e+00   0.00000000e+00   0.00000000e+00
   0.00000000e+00   0.00000000e+00   0.00000000e+00   0.00000000e+00]
[-3.83237478]

To test the predictive power of our model let's see what prediction error it achieves on the testing set.

In [127]:
mnist_test_data = pd.read_csv('mnist_test.csv',names=range(785))
test_37 = mnist_test_data[(mnist_test_data[0] == 7) | (mnist_test_data[0] == 3) ]
test_37['is_3']  = -1*np.ones(len(test_37))
test_37['is_3'].loc[test_37.iloc[:,0] == 3] = 1
C:\Users\ymintz\Anaconda3\lib\site-packages\ipykernel\__main__.py:3: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  app.launch_new_instance()
C:\Users\ymintz\Anaconda3\lib\site-packages\pandas\core\indexing.py:132: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self._setitem_with_indexer(indexer, value)
C:\Users\ymintz\Anaconda3\lib\site-packages\ipykernel\__main__.py:4: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
In [126]:
X_test = test_37[var_names[1:-1]]
Y_test = test_37['is_3']
X_test = pd.DataFrame.as_matrix(X_test)
Y_test = pd.DataFrame.as_matrix(Y_test) 
Y_test = np.around(Y_test)
model.score(X_test,Y_test)
Out[126]:
0.97644749754661431

So simply using a linear kernel we were able to product >97% accuracy in destinguishing between hand written number 3's and number 7's. However, we know that prediction accuracy could potentially be improved by considering a different kernel function. Recall that the linear kernel function is given by the dot product $x \cdot x$, and corresponds to simply using the features we have in our data set and separating them by a hyper plane. By using nonlinear kernels we can infer more complicated relationships in our data, since the kernels extract additional features (such as different powers, trigonometric transforms etc.) from our data. This corresponds to projecting our data onto a higher dimensional space and finding a best fitting hyperplane in this new projected space. When performing this operation we need to consider potential for overfitting since we are adding potentially infinite new features to our data set, and we are losing the interpretability of our model since we are not able to retrieve the hyperplane parameters ($\beta$) for the lifted classifier.

Much like in the case of kernel density estimation a commonly used kernel for nonelinear SVMs is the Gaussian or normal kernel given by: $K(x_1,x_2) = \exp(-\gamma (x_1-x_2)^2)$. This sort of kernel heuristically allows us to incorporate all polynomial information about our data since: $$\exp(-\gamma (x_1-x_2)^2) = \sum_{k=0}^\infty \frac{(-\gamma (x_1-x_2)^2)^k}{k!} $$

So essentially by using this kernel it is as if we are extracting an "infinite" amount of features from our data set. Let us train an SVM with this sort of kernel on our data set and test its predictive ability.

In [128]:
model_normal = svm.SVC(kernel='rbf')
model_normal.fit(X,Y)
model_normal.score(X_test,Y_test)
Out[128]:
0.50441609421000977

Our prediction score decreased when we implemented a gaussian kernel why could this occur?

One potential reason is over fitting, but note that with this new kernel we have introduced a parameter $\gamma$. Running the model with the default $\gamma$ value may not be the best course of action and in general we would want to use some sort of cross validation for the $\gamma$ to have a truely fair comparisson of our two models.