In the previous discussions we examined some theoretical properties of estimators and estimation methods. In this discussion we will look into more computational issues related to estimation and these include cross validation and density estimation.
In this section we will fit a linear model to a real world data set using cross validation. Below is the data of agricultural output and related resources for 22 countries in 1950.
import pandas as pd
farm_data = pd.read_csv('farming_data.csv')
farm_data.set_index('Country',inplace='True')
print(farm_data.head(20))
Since we want to form a linear model, what should be our predictors and what should be our response?
We will try to predict net agricultural output using agriculture population, arable land, conversion ratio, productive livestock, work stock, fertilizer consumption, and number of tracktors. The type of model we will consider is of the form: $$\text{agricultural_output}_i = \sum_{i=1}^{7}\beta_ix_i + \beta_0 + \epsilon_i $$
import statsmodels.formula.api as sm
import numpy as np
farm_data['intercept'] = np.ones(( len(farm_data), ))
var_names = list(farm_data)
X = farm_data[var_names[1:]][:]
Y = farm_data['Agricultural_Output'][:]
X = pd.DataFrame.as_matrix(X)
Y = pd.DataFrame.as_matrix(Y)
result = sm.OLS(Y,X).fit()
result.summary()
Looking at the diagnostic what do you make of this fit?
Clearly we have a very high are squared by it seems like the values of the variable coeffiecients are varying from very small to very large, this could indicate a model selection issue. In the last discussion we saw that the problem of model seleciton can be adressed by using regularization. Below is the type of problem which we would use to fit with $\lambda$ and $\mu$ being the two regularization coefficients:
$$\min_{\beta} \sum_{i=1}^{22} (\text{agricultural output}_i - \sum_{i=1}^{7}\beta_ix_i + \beta_0)^2 + \mu\sum_{j=0}^7\beta_j^2 + \lambda\sum_{j=0}^7|\beta_j| $$What sort of regularization/regression problem would we be solving if $\lambda >0 $ and $\mu = 0$? what if $\lambda = 0,\mu>0$? And what if $\lambda>0,\mu>0$?
In general it is hard to tell a priori which one of these methods we should use for our problem and what values our parameters should take. To answer this question we can use cross validation. To do this we will pick some configurations of $\lambda$ and $\mu$ values and test them out to see which produce the least crossvalidaition error. A popular method for crossvalidation is leave $k$ out; however, since we have a small data set of only 20 points, we should use leave one out cross validation. Below is the code for performing leave one out cross validation for our problem:
import numpy.ma as ma
lambda_vals = [0.0, 2.0, 20.0, 200.0, 2000.0]
mu_fracs = [0.0, 0.25, 0.5, 0.75, 1.0]
cross_val_errors = {}
for lambda_par in lambda_vals:
for mu_frac in mu_fracs:
cross_val_errors[(lambda_par,mu_frac)] = 0.0
train_set_pred = np.ma.array(X,mask=False)
train_set_vals = np.ma.array(Y,mask=False)
for ind_i in range(len(farm_data)):
test_set_pred = X[ind_i][:]
test_set_val = Y[ind_i]
train_set_pred[ind_i][:] = ma.masked
train_set_vals[ind_i] = ma.masked
for lambda_par in lambda_vals:
for mu_frac in mu_fracs:
result_lasso = sm.OLS(train_set_vals,train_set_pred).fit_regularized(alpha=lambda_par,L1_wt=mu_frac)
coeffs = result_lasso.params
cross_val_errors[(lambda_par,mu_frac)] += ((np.dot(test_set_pred,coeffs) - test_set_val)**2)/(21.0)
train_set_pred[ind_i][:] = test_set_pred
train_set_vals[ind_i] = test_set_val
for key in cross_val_errors.keys():
cross_val_errors[key] = np.round(cross_val_errors[key],2)
cross_val_df = pd.DataFrame.from_dict(cross_val_errors, orient='index')
print(cross_val_df)
minimum_cross_val_error = 999999.0
arg_min = (-10.0,-10.0)
for key in cross_val_errors.keys():
if cross_val_errors[key] <= minimum_cross_val_error:
minimum_cross_val_error = cross_val_errors[key]
arg_min = key
print("The minimum cross validation error is: " + str(minimum_cross_val_error) + " with lambda=" + str(arg_min[0]*arg_min[1]) +" and mu = " + str((1.0-arg_min[1])*arg_min[0]))
In this particular case it seems like our cross validation yielded that we can do no better by adding regularization. However, we did not test a vary large amount of parameters, we can further refine our search using cross validation to see if we can improve our fit.
Consider the following data which we are told is generated from a continuous distribution:
data = [4.45559267, 6.74293422, -20.14864735, 3.49075864, 1.3224875,
-0.0674377, 2.03760975, 3.89168089, 4.30622691, 5.24243778,
4.47118085, 6.38819759, 2.31717906, 9.23465033, -3.71355397,
6.25079944, 4.45762639, 5.3745804, 1.18938308, 2.22251412,
7.76006157, 13.27124009, 3.59884813, 6.45666895, 9.19240938,
9.67072606, -0.31431053, -2.64892822, 1.76057693, 9.23526739,
0.12167676, 4.48484103, 12.4229351, 5.2058985, 6.45232824,
3.61880038, 6.40043509, 8.31918962, -4.92504348, 7.08117382,
16.41249085, 7.05750976, 3.2652978, 7.59222402, 0.26693817,
4.66984232, 10.09794026, 3.40297516, 3.34271601, 0.95943834]
Suppose we are interested in estimating the distribution of this data. One method we could use is to compute the emperical cdf of the data, what is the method for doing so?
Recall the formula for the emperical cdf: $$\hat{F}(x) = \frac{1}{n}\sum_{i=1}^n \mathbb{1}\{x_i \leq x \} $$
Using this method we can compute the emperical cdf as follows:
import matplotlib.pyplot as plt
%matplotlib inline
data = np.array(data)
data.sort
emperical_cdf_x = np.linspace(-30,30,num=100)
emperical_cdf = np.zeros(100)
for ind in range(100):
val = emperical_cdf_x[ind]
for observation in data:
if observation <= val:
emperical_cdf[ind] += 1.0/50.0
plt.figure()
plt.plot(emperical_cdf_x,emperical_cdf)
While this cdf is useful for various purposes (such as random sampling and simulation), we also want to find an estimate of the underlying probability density. However, finding the density function is quite difficult since we are estimating a continuous distribution. One technique for estimating the density function is through using histograms. To form a histogram we construct buckets with an upper and lower bound and count the number of observations we see in each bucket. One thing to note is that it is very difficult to know how many buckets to use and where their cut offs should be. In practice we need to examine several different histogram schemes and choose one which gives us the most reasonable results. Below are several histograms generated from our data with various bin counts and one "automated" bin count using the heurstics in python:
%matplotlib inline
plt.figure()
plt.hist(data,bins=1)
plt.figure()
plt.hist(data,bins=10)
plt.figure()
plt.hist(data,bins=50)
plt.figure()
plt.hist(data,bins='auto')
Another useful tool for estimating the denstiy function of our data is using a kernel estimator. Recall that a suitible kernel function $\frac{1}{h}K(\frac{u}{h})$ with bandwidth $h$ has the following properties:
There are many suitible Kernel functions and in general it is up to our discression as analysts on which to use for forming our density estimate. One popular Kernel function is the gaussian kernel: $ \frac{1}{h\sqrt{2\pi}(\Phi(h) - \Phi(-h))}\exp(\frac{-(x-x_i)^2}{2h^2})$. Here $\Phi(\cdot)$ is the cdf of a standard normal distribution and is included to ensure that the truncated integral satisfies property 4 above. As before with the histograms, choosing the correct value for the bandwidth $h$ is key in performing useful analysis of the data. While there are many heuristics for getting a good value for this parameter, it is often better to calibrated manually and use best judgement. Using this kernel function we can compute the following kernel density estimate for our data:
import numpy as np
from scipy import stats
kernel = stats.gaussian_kde(data,bw_method=None)
kernel_x_axis = np.linspace(-30,30,num=100)
kernel_vals = []
for ind in range(len(kernel_x_axis)):
val = kernel_x_axis[ind]
kernel_vals.append(kernel(val))
plt.figure()
plt.plot(kernel_x_axis,kernel_vals)