IEOR 165 Discussion 5: Cross Validation and Density Estimation

In the previous discussions we examined some theoretical properties of estimators and estimation methods. In this discussion we will look into more computational issues related to estimation and these include cross validation and density estimation.

Example 1: Cross Validating Regression Models

In this section we will fit a linear model to a real world data set using cross validation. Below is the data of agricultural output and related resources for 22 countries in 1950.

In [14]:
import pandas as pd
farm_data = pd.read_csv('farming_data.csv')
farm_data.set_index('Country',inplace='True')
print(farm_data.head(20))
              Agricultural_Output  Population_In_Agriculture  Arable_Land  \
Country                                                                     
U.S.                        17346                       8851       468033   
Canada                       1935                       1272        91266   
U.K.                         2102                       1221        18960   
Norway                        290                        495         2018   
France                       3137                       7490        52796   
W.Germany                    1790                       4247        21399   
Argentina                    1373                       1570        79789   
Denmark                       614                        540         6656   
Netherlands                   610                        746         2760   
South Africa                  438                       2651        25071   
Ireland                       388                        594         3877   
Poland                       2769                       7035        41755   
Chile                         185                        732        14688   
PuertoRico                    160                        246         1023   
Japan                        2346                      18623        14852   
Italy                        3348                       9127        38337   
Mexico                        604                       3803        29640   
Greece                        411                       1507         8255   
Turkey                       1199                       5724        37552   
Egypt                         885                       7558         6039   

              Conversion_Ratio  Productive_Livestock  Work_Stock  \
Country                                                            
U.S.                      0.02                 77474         825   
Canada                    0.02                  7655        1796   
U.K.                      0.02                  9604         625   
Norway                    0.02                  1251         198   
France                    0.02                 26070        2613   
W.Germany                 0.02                 10202        1628   
Argentina                 0.02                 32294         772   
Denmark                   0.02                  2903         532   
Netherlands               0.02                  2122         276   
South Africa              0.05                 13690         158   
Ireland                   0.02                  3661         526   
Poland                    0.02                  6467        2541   
Chile                     0.02                  2699         613   
PuertoRico                0.05                   336          60   
Japan                     0.02                  1094        1989   
Italy                     0.02                  8700        1932   
Mexico                    0.02                 12906        6583   
Greece                    0.02                  1645         793   
Turkey                    0.01                 12664        3915   
Egypt                     0.01                  4416        2406   

              Fertilizer_Consumption  Number_of_Tracktors  
Country                                                    
U.S.                          3952.1              3550000  
Canada                         192.1               367828  
U.K.                           765.9               308540  
Norway                          99.6                 9506  
France                         870.8               122624  
W.Germany                     1283.6               109776  
Argentina                       13.9                25000  
Denmark                        198.6                12257  
Netherlands                    368.5                15950  
South Africa                    90.0                39500  
Ireland                         60.0                 9480  
Poland                         158.4                14500  
Chile                           35.4                 6000  
PuertoRico                      65.4                 2150  
Japan                          628.6                 1810  
Italy                          346.3                50590  
Mexico                          13.8                32000  
Greece                          38.1                 2869  
Turkey                           6.3                 3959  
Egypt                           97.5                 5400  

Since we want to form a linear model, what should be our predictors and what should be our response?

We will try to predict net agricultural output using agriculture population, arable land, conversion ratio, productive livestock, work stock, fertilizer consumption, and number of tracktors. The type of model we will consider is of the form: $$\text{agricultural_output}_i = \sum_{i=1}^{7}\beta_ix_i + \beta_0 + \epsilon_i $$

In [41]:
import statsmodels.formula.api as sm
import numpy as np

farm_data['intercept'] = np.ones(( len(farm_data), ))
var_names = list(farm_data)

X = farm_data[var_names[1:]][:]
Y = farm_data['Agricultural_Output'][:]
X = pd.DataFrame.as_matrix(X)
Y = pd.DataFrame.as_matrix(Y)
result = sm.OLS(Y,X).fit()
result.summary()
Out[41]:
OLS Regression Results
Dep. Variable: y R-squared: 0.986
Model: OLS Adj. R-squared: 0.979
Method: Least Squares F-statistic: 142.6
Date: Wed, 22 Feb 2017 Prob (F-statistic): 6.36e-12
Time: 13:41:47 Log-Likelihood: -165.49
No. Observations: 22 AIC: 347.0
Df Residuals: 14 BIC: 355.7
Df Model: 7
Covariance Type: nonrobust
coef std err t P>|t| [95.0% Conf. Int.]
x1 0.1154 0.038 3.007 0.009 0.033 0.198
x2 0.0180 0.011 1.687 0.114 -0.005 0.041
x3 -1.115e+04 1.26e+04 -0.883 0.392 -3.82e+04 1.59e+04
x4 -5.851e-05 0.026 -0.002 0.998 -0.057 0.057
x5 -0.0942 0.048 -1.972 0.069 -0.197 0.008
x6 0.9762 0.545 1.793 0.095 -0.192 2.144
x7 0.0011 0.001 0.815 0.429 -0.002 0.004
const 413.4767 335.271 1.233 0.238 -305.609 1132.562
Omnibus: 9.893 Durbin-Watson: 2.706
Prob(Omnibus): 0.007 Jarque-Bera (JB): 7.543
Skew: 1.240 Prob(JB): 0.0230
Kurtosis: 4.443 Cond. No. 8.16e+07

Looking at the diagnostic what do you make of this fit?

Clearly we have a very high are squared by it seems like the values of the variable coeffiecients are varying from very small to very large, this could indicate a model selection issue. In the last discussion we saw that the problem of model seleciton can be adressed by using regularization. Below is the type of problem which we would use to fit with $\lambda$ and $\mu$ being the two regularization coefficients:

$$\min_{\beta} \sum_{i=1}^{22} (\text{agricultural output}_i - \sum_{i=1}^{7}\beta_ix_i + \beta_0)^2 + \mu\sum_{j=0}^7\beta_j^2 + \lambda\sum_{j=0}^7|\beta_j| $$

What sort of regularization/regression problem would we be solving if $\lambda >0 $ and $\mu = 0$? what if $\lambda = 0,\mu>0$? And what if $\lambda>0,\mu>0$?

In general it is hard to tell a priori which one of these methods we should use for our problem and what values our parameters should take. To answer this question we can use cross validation. To do this we will pick some configurations of $\lambda$ and $\mu$ values and test them out to see which produce the least crossvalidaition error. A popular method for crossvalidation is leave $k$ out; however, since we have a small data set of only 20 points, we should use leave one out cross validation. Below is the code for performing leave one out cross validation for our problem:

In [46]:
import numpy.ma as ma
    
lambda_vals = [0.0, 2.0, 20.0, 200.0, 2000.0]
mu_fracs = [0.0, 0.25, 0.5, 0.75, 1.0]

cross_val_errors = {}
for lambda_par in lambda_vals:
        for mu_frac in mu_fracs:
            cross_val_errors[(lambda_par,mu_frac)] = 0.0


train_set_pred = np.ma.array(X,mask=False)
train_set_vals = np.ma.array(Y,mask=False)

for ind_i in range(len(farm_data)):
    test_set_pred = X[ind_i][:]
    test_set_val = Y[ind_i]
    train_set_pred[ind_i][:] = ma.masked
    train_set_vals[ind_i] = ma.masked
    
    for lambda_par in lambda_vals:
        for mu_frac in mu_fracs:
            result_lasso = sm.OLS(train_set_vals,train_set_pred).fit_regularized(alpha=lambda_par,L1_wt=mu_frac)
            coeffs = result_lasso.params
            
            cross_val_errors[(lambda_par,mu_frac)] += ((np.dot(test_set_pred,coeffs) - test_set_val)**2)/(21.0)
    
    train_set_pred[ind_i][:] = test_set_pred
    train_set_vals[ind_i] = test_set_val

    
for key in cross_val_errors.keys():
    cross_val_errors[key] = np.round(cross_val_errors[key],2)

    
    
C:\Users\ymintz\Anaconda3\lib\site-packages\ipykernel\__main__.py:28: MaskedArrayFutureWarning: setting an item on a masked array which has a shared mask will not copy the mask and also change the original mask array in the future.
Check the NumPy 1.11 release notes for more information.
In [50]:
cross_val_df = pd.DataFrame.from_dict(cross_val_errors, orient='index')
print(cross_val_df)    
                        0
(2.0, 0.25)     228720.76
(0.0, 0.75)     209606.91
(200.0, 0.0)    233496.58
(200.0, 0.75)   233548.97
(200.0, 0.25)   233528.27
(20.0, 1.0)     222243.88
(2.0, 1.0)      221279.95
(200.0, 1.0)    233548.87
(0.0, 0.5)      209606.91
(200.0, 0.5)    233549.08
(2000.0, 0.25)  233637.23
(20.0, 0.0)     233037.30
(0.0, 1.0)      209606.91
(2000.0, 0.5)   233627.15
(2.0, 0.75)     224953.45
(0.0, 0.25)     209606.91
(20.0, 0.75)    232063.80
(0.0, 0.0)      209606.91
(20.0, 0.5)     232694.69
(2000.0, 1.0)   233608.79
(2.0, 0.5)      227351.88
(2000.0, 0.0)   233642.41
(2000.0, 0.75)  233617.66
(2.0, 0.0)      229591.04
(20.0, 0.25)    232920.96
In [59]:
minimum_cross_val_error = 999999.0
arg_min = (-10.0,-10.0)

for key in cross_val_errors.keys():
    if cross_val_errors[key] <= minimum_cross_val_error:
        minimum_cross_val_error = cross_val_errors[key]
        arg_min = key

print("The minimum cross validation error is: " + str(minimum_cross_val_error) + " with lambda=" + str(arg_min[0]*arg_min[1]) +" and mu = " + str((1.0-arg_min[1])*arg_min[0]))
The minimum cross validation error is: 209606.91 with lambda=0.0 and mu = 0.0

In this particular case it seems like our cross validation yielded that we can do no better by adding regularization. However, we did not test a vary large amount of parameters, we can further refine our search using cross validation to see if we can improve our fit.

Example 2: Density Estimation for a Set of Data

Consider the following data which we are told is generated from a continuous distribution:

In [2]:
data = [4.45559267,   6.74293422, -20.14864735,   3.49075864,   1.3224875, 
  -0.0674377,    2.03760975,   3.89168089,   4.30622691,   5.24243778, 
   4.47118085,   6.38819759,   2.31717906,   9.23465033,  -3.71355397, 
   6.25079944,  4.45762639,   5.3745804,    1.18938308,   2.22251412, 
   7.76006157,  13.27124009,   3.59884813,   6.45666895,   9.19240938, 
   9.67072606,  -0.31431053,  -2.64892822,   1.76057693,   9.23526739, 
   0.12167676,   4.48484103,  12.4229351,   5.2058985,    6.45232824, 
   3.61880038,   6.40043509,   8.31918962,  -4.92504348,   7.08117382, 
  16.41249085,   7.05750976,   3.2652978,    7.59222402,   0.26693817, 
   4.66984232,  10.09794026,   3.40297516,   3.34271601,   0.95943834]

Suppose we are interested in estimating the distribution of this data. One method we could use is to compute the emperical cdf of the data, what is the method for doing so?

Recall the formula for the emperical cdf: $$\hat{F}(x) = \frac{1}{n}\sum_{i=1}^n \mathbb{1}\{x_i \leq x \} $$

Using this method we can compute the emperical cdf as follows:

In [5]:
import matplotlib.pyplot as plt
%matplotlib inline
data = np.array(data)
data.sort

emperical_cdf_x = np.linspace(-30,30,num=100)
emperical_cdf = np.zeros(100)

for ind in range(100):
    val = emperical_cdf_x[ind]
    for observation in data:
        if observation <= val:
            emperical_cdf[ind] += 1.0/50.0

plt.figure()
plt.plot(emperical_cdf_x,emperical_cdf)
Out[5]:
[<matplotlib.lines.Line2D at 0x84a7f28>]

While this cdf is useful for various purposes (such as random sampling and simulation), we also want to find an estimate of the underlying probability density. However, finding the density function is quite difficult since we are estimating a continuous distribution. One technique for estimating the density function is through using histograms. To form a histogram we construct buckets with an upper and lower bound and count the number of observations we see in each bucket. One thing to note is that it is very difficult to know how many buckets to use and where their cut offs should be. In practice we need to examine several different histogram schemes and choose one which gives us the most reasonable results. Below are several histograms generated from our data with various bin counts and one "automated" bin count using the heurstics in python:

In [75]:
%matplotlib inline
plt.figure()
plt.hist(data,bins=1)
plt.figure()
plt.hist(data,bins=10)
plt.figure()
plt.hist(data,bins=50)
plt.figure()
plt.hist(data,bins='auto')
Out[75]:
(array([  1.,   0.,   0.,   0.,   0.,   1.,   2.,   4.,   9.,  14.,  11.,
          5.,   2.,   1.]),
 array([-20.14864735, -17.53713748, -14.92562761, -12.31411774,
         -9.70260786,  -7.09109799,  -4.47958812,  -1.86807825,
          0.74343162,   3.35494149,   5.96645136,   8.57796124,
         11.18947111,  13.80098098,  16.41249085]),
 <a list of 14 Patch objects>)

Another useful tool for estimating the denstiy function of our data is using a kernel estimator. Recall that a suitible kernel function $\frac{1}{h}K(\frac{u}{h})$ with bandwidth $h$ has the following properties:

  1. $K$ has finite support (only positive from $[-h,h]$)
  2. $K$ is symmetric $\frac{1}{h}K(\frac{u}{h})=\frac{1}{h}K(\frac{-u}{h})$
  3. $K>0$ for $-h \leq u \leq h$
  4. $\int_{-h}^h \frac{1}{h}K(\frac{u}{h}) du = 1$
  5. Maximum value of $\frac{1}{h} \max_u K(u)$

There are many suitible Kernel functions and in general it is up to our discression as analysts on which to use for forming our density estimate. One popular Kernel function is the gaussian kernel: $ \frac{1}{h\sqrt{2\pi}(\Phi(h) - \Phi(-h))}\exp(\frac{-(x-x_i)^2}{2h^2})$. Here $\Phi(\cdot)$ is the cdf of a standard normal distribution and is included to ensure that the truncated integral satisfies property 4 above. As before with the histograms, choosing the correct value for the bandwidth $h$ is key in performing useful analysis of the data. While there are many heuristics for getting a good value for this parameter, it is often better to calibrated manually and use best judgement. Using this kernel function we can compute the following kernel density estimate for our data:

In [8]:
import numpy as np
from scipy import stats

kernel = stats.gaussian_kde(data,bw_method=None)
kernel_x_axis = np.linspace(-30,30,num=100)
kernel_vals = []
for ind in range(len(kernel_x_axis)):
    val = kernel_x_axis[ind]
    kernel_vals.append(kernel(val))
   
plt.figure()
plt.plot(kernel_x_axis,kernel_vals)
Out[8]:
[<matplotlib.lines.Line2D at 0x8abfb70>]