IEOR 165 Discussion 5: Bias Variance Tradeoff and MAP Estimation

In the previous discussions we discussed linear models and maximum likelihood estimation. For this discussion we will analyze the performance of these techniques using bias and variance and generalize them with MAP estimation. We will see how we can use these notions to perform model selection.

Example 1: Bias and Variance Tradeoff for MLE

Suppose we have a data set given by $X_1,X_2,...,X_n$ to which we want to fit some statistical model with a paramter $\beta$. Let $\hat{\beta}$ be the estimator of $\beta$, recall from lecture that to analyze the quality of this estimate we can consider its bias and its variance defined as: $$\text{bias}(\hat{\beta}) = \mathbb{E}[\hat{\beta}] - \beta $$ $$ \text{Var}(\hat{\beta}) = \mathbb{E}[\hat{\beta}^2] -(\mathbb{E}[\hat{\beta}])^2$$

$\text{bias}(\hat{\beta})$ measures wether we would estimate the correct paramters on average while $\text{Var}(\hat{\beta})$ measures how sensitive our estimate is to noise in the data. Recall that $\text{bias}(\hat{\beta})^2 + \text{Var}(\hat{\beta}) = \mathbb{E}[(\hat{\beta}-\beta)^2]$ which forms the basis for the tradeoff between these two terms. Let's use these notions to analyze some estimates formed from parametric models.

Let $X_1,X_2,...,X_n$ be i.i.d random variables such that $X_i \sim \text{Exp}(\lambda)$. What is the MLE estimate of the parameter $\lambda$?

We note that $\mathcal{L}(\lambda) = \lambda^n \exp(-\lambda \sum_{i=1}^nX_i)$, so taking the negative log-likelihood we obtain that the optimization problem we want to solve is: $$\min\limits_{\lambda>0} \lambda \sum_{i=1}^nX_i - n\ln \lambda $$

Taking the derivative and setting it equals to zero gives that $\hat{\lambda}_{mle} = \frac{n}{\sum_{i=1}^nX_i}$. What is the bias of this MLE estimate?

Recall $\text{bias}(\hat{\lambda}_{mle}) = \mathbb{E}[\hat{\lambda}_{mle} - \lambda]$. so we need to compute $\mathbb{E}[\hat{\lambda}_{mle}] = \mathbb{E}[\frac{n}{\sum_{i=1}^nX_i}]$. Note that the joint likelihood we computed previously is the joint distribution of the $X_i$'s so we want to solve the multiple integral: $$\int_0^\infty \int_0^\infty .... \int_0^\infty \frac{n\lambda^n}{\sum_{i=1}^nx_i}\exp(-\lambda \sum_{i=1}^nx_i) dx_1 dx_2 ... dx_n$$

Lets define a variable $Y = \sum_{i=1}^nX_i$, we know that the sum of i.i.d. exponentials is distributed Erlang (a special case of the gamma distribution) and so $Y \sim \text{Erlang}(n,\lambda)$, which has pdf $f(y) = \frac{\lambda^ny^{n-1}\exp(-\lambda y)}{(n-1)!}$. Using this change of variable we can write the above integral as: $$n \int_0^\infty \frac{y^{-1}y^{n-1}\lambda^n\exp(-\lambda y)}{(n-1)!} dy = \frac{n\lambda}{(n-1)}\int_0^\infty \frac{y^{n-2}\lambda^{n-1}\exp(-\lambda y)}{(n-2)!} dy =$$ $$ \frac{n\lambda}{(n-1)} $$

So the bias is given by $\frac{n}{(n-1)}\lambda -\lambda = \frac{1}{n-1}\lambda$. Is this estimator unbiased? What is the variance of this estimator?

Using the same change of variable argument we can write $\text{Var}(\hat{\lambda}_{mle}) = n^2\text{Var}(\frac{1}{Y}) = n^2(\mathbb{E}[\frac{1}{Y^2}] - \mathbb{E}[\frac{1}{Y}]^2)$. Since we already calaculated $\mathbb{E}[\frac{1}{Y}] $ we need to simply compute the second moment: $$ \mathbb{E}[\frac{1}{Y^2}] = \int_0^\infty \frac{y^{n-3}\lambda^n\exp(-\lambda y)}{(n-1)!} dy = \frac{\lambda^{2}}{(n-1)(n-2)} \int_0^\infty \frac{y^{n-3}\lambda^{n-2}\exp(-\lambda y)}{(n-3)!} dy = \frac{\lambda^{2}}{(n-1)(n-2)}$$

So: $$\text{Var}(\hat{\lambda}_{mle}) = n^2(\frac{\lambda^{2}}{(n-1)(n-2)} - \frac{\lambda^2}{(n-1)^2} )= \frac{n^2\lambda^2}{(n-1)^2(n-2)}$$

What is the mean squared error of this estimator?

Lets use the formula: $\text{MSE}(\hat{\lambda}_{mle}) = \text{bias}(\hat{\beta})^2 + \text{Var}(\hat{\beta}) =\frac{\lambda^2}{(n-1)^2}+ \frac{n^2\lambda^2}{(n-1)^2(n-2)} = \frac{\lambda^2(n^2 + n - 2)}{(n-1)^2(n-2)} = \frac{\lambda^2(n+ 2)}{(n-1)(n-2)}$

Suppose instead we want to construct an unbiased estimator of the parameter $\lambda$ how could we do this using $\hat{\lambda}_{mle}$?

One simple way of doing this is by scaling. Recall $\mathbb{E}[\hat{\lambda}_{mle}] = \frac{n\lambda}{(n-1)}$, so if we let $\hat{\lambda}_{unb} = \frac{n-1}{n} \hat{\lambda}_{mle}$ we can clearly see we have an unbiased estimator. What is the variance of this estimator?

Using properties of variance: $\text{Var}(\hat{\lambda}_{unb}) = (\frac{n-1}{n})^2 \text{Var}(\hat{\lambda}_{mle}) = \frac{\lambda^2}{n-2}$. What is the mean squared error of this new estimator? How does it compare to the mean squared error of the previous estimator?

Since this is an unbiased estimator the MSE is simply the variance. We note that $\frac{\lambda^2(n+ 2)}{(n-1)(n-2)} - \frac{\lambda^2}{n-2} = \frac{3\lambda^2}{(n-1)(n-2)} > 0$ so we have decreased the MSE. This suggests that we can form a shrinkege estimator of the form $\hat{\lambda}_{mse} = k \hat{\lambda}_{mle}$ which will minimize MSE, what is this estimator?

Let's plug in this shrinkage formulat to the MSE Problem:

$$\min_k \mathbb{E}[(k\hat{\lambda}_{mle} - \lambda)^2] = \min_k k^2 \mathbb{E}[\hat{\lambda}_{mle}^2] - 2k\lambda\mathbb{E}[\hat{\lambda}_{mle}] + \lambda^2 $$

Putting in and doing some algebriac simplification shows that this problem has the same minimizer as $\min_k k^2n^2 -2k(n-2)n $. Taking the derivative and setting to zero shows that $k = \frac{n-2}{n}$. Is this new estimator unbiased? What is the MSE of this estimator?

Clearly this estimator is biased since $\mathbb{E}[\hat{\lambda}_{mse}] = \frac{n-2}{n} \mathbb{E}[\hat{\lambda}_{mle}] = \frac{n-2}{n-1}\lambda$. The variance is given by $\text{Var}(\hat{\lambda}_{mse}) = (\frac{n-2}{n})^2\text{Var}(\hat{\lambda}_{mle}) = \frac{(n-2)\lambda^2}{(n-1)^2}$. Hence our mean squared error is: $$\text{MSE}(\hat{\lambda}_{mse}) = \frac{\lambda^2}{(n-1)^2} + \frac{(n-2)\lambda^2}{(n-1)^2} = \frac{\lambda^2}{n-1}$$

Example 2: Selecting Sparse model using MAP and Regularization

In this example we will explore using regularization and MAP to calibrate temperature sensors in a room. In particular, we will consider how these methods can be used for model selection. Suppose we have placed 7 sensors in an array in a room and used an independant thermometer to record the temperature in the room every hour for 50 hours. Below is the data we obtain from these measurements:

In [28]:
import pandas as pd
temp_data = pd.read_csv('temp_data.csv')
print(temp_data.head(20))
    sensor1  sensor2  sensor3  sensor4  sensor5  sensor6  sensor7  temperature
0     56.68    45.54    40.77    53.07    44.37    42.88    48.35        42.13
1     68.81    67.65    67.15    61.08    68.52    78.69    68.93        68.34
2     40.00    45.93    48.47    75.44    58.78    62.72    56.66        51.09
3     52.09    51.06    50.62    54.29    50.92    48.13    51.28        45.83
4     45.87    55.52    59.66    76.34    61.44    50.09    57.99        48.97
5     43.69    42.60    42.13    64.93    54.74    69.75    55.12        53.10
6     47.45    58.31    62.96    40.63    53.59    47.82    49.72        46.30
7     53.82    48.26    45.87    77.18    58.04    63.25    60.03        54.08
8     55.87    61.26    63.57    67.64    68.40    78.80    66.47        65.12
9     61.55    66.06    67.99    79.89    72.44    73.87    70.82        65.97
10    56.77    47.89    44.09    46.89    46.16    49.59    49.34        45.61
11    67.41    59.82    56.56    45.49    54.59    59.75    57.30        55.52
12    48.18    61.90    67.78    77.30    69.42    64.80    64.52        58.35
13    75.12    62.14    56.57    67.87    63.54    73.16    68.18        65.34
14    41.10    41.73    42.00    42.64    43.23    46.27    43.00        39.65
15    66.82    63.05    61.44    70.22    58.46    40.74    59.80        49.28
16    56.69    63.59    66.55    70.16    61.52    42.80    59.05        49.05
17    62.35    61.12    60.60    76.92    64.39    59.45    64.83        57.46
18    45.62    68.13    77.78    68.46    72.07    64.25    64.03        59.04
19    47.92    58.80    63.46    44.97    58.66    62.75    54.78        53.94

Given that this data is continuous, it seems reasonable to assume that we can use a linear model. What should we use as our resposne and what should be our predictors?

Lets set temperature as our response and the sensor readings as the predictors. So a potential linear model we can fit is of the form: $$\text{temperature}_i = \sum_{j=1}^7 \beta_j \cdot \text{sensor}j_i + \beta_0 + \epsilon_i $$

As before we can use linear regression to fit this model to our data and we obtain the following regression results:

In [7]:
import statsmodels.formula.api as sm
import numpy as np
temp_data['intercept'] = np.ones(( len(temp_data), ))
X = temp_data[['sensor1','sensor2','sensor3','sensor4','sensor5','sensor6','sensor7','intercept']][:-1]
Y = temp_data['temperature'][:-1]
X = pd.DataFrame.as_matrix(X)
Y = pd.DataFrame.as_matrix(Y)
result = sm.OLS(Y,X).fit()
result.summary()
Out[7]:
OLS Regression Results
Dep. Variable: y R-squared: 1.000
Model: OLS Adj. R-squared: 1.000
Method: Least Squares F-statistic: 4.792e+07
Date: Wed, 15 Feb 2017 Prob (F-statistic): 1.36e-139
Time: 14:03:05 Log-Likelihood: 224.60
No. Observations: 49 AIC: -433.2
Df Residuals: 41 BIC: -418.1
Df Model: 7
Covariance Type: nonrobust
coef std err t P>|t| [95.0% Conf. Int.]
x1 0.3801 0.054 7.073 0.000 0.272 0.489
x2 -0.2437 0.140 -1.747 0.088 -0.525 0.038
x3 0.4481 0.129 3.480 0.001 0.188 0.708
x4 0.1423 0.054 2.642 0.012 0.034 0.251
x5 0.0389 0.130 0.299 0.767 -0.224 0.302
x6 0.5424 0.054 10.070 0.000 0.434 0.651
x7 -0.3081 0.131 -2.348 0.024 -0.573 -0.043
const -4.2214 0.004 -1128.829 0.000 -4.229 -4.214
Omnibus: 1.580 Durbin-Watson: 2.883
Prob(Omnibus): 0.454 Jarque-Bera (JB): 1.145
Skew: -0.088 Prob(JB): 0.564
Kurtosis: 2.272 Cond. No. 8.57e+04

Looking at these regression results what do we think about the outcome of this regression?

Seeing as the $R^2$ is 1.0 we should be weary that we are overfitting the model, that is we may be including too much flexibility in our model and not identifying a true trend. In fact we see that some factors in our model have negative coefficients while others have positive ones, as if the linear model is subtracting correlated variables. One way of avoiding overfitting is to perform model selection. Instead of using all the data we will use only the data from some subset of the sensors. While there are many methods for doing this, one of the prefered methods in application is through regularization.

Suppose we assume a Laplacian prior on our data, in lecture we saw this as the equivalent of solving the following optimization problem: $$ \min \sum_{i=1}^{50} (\text{temperature}_i - (\sum_{j=1}^7 \beta_j \cdot \text{sensor}j_i + \beta_0))^2 + \lambda\sum_{j =1}^7|\beta_j|$$

A nice property of this regularization is that by increasing the size of $\lambda$ we can effectively decrease the number of factors we include in our model by reducing the value of coefficients of irrelevant variables. Using the same python package as before we can use regularization to fit a lasso regression to our model as follows:

In [19]:
result_lasso = sm.OLS(Y,X).fit_regularized(alpha=100)
result_lasso.summary()
Out[19]:
OLS Regression Results
Dep. Variable: y R-squared: 0.940
Model: OLS Adj. R-squared: 0.930
Method: Least Squares F-statistic: 91.52
Date: Wed, 15 Feb 2017 Prob (F-statistic): 5.71e-23
Time: 14:32:25 Log-Likelihood: -96.507
No. Observations: 49 AIC: 209.0
Df Residuals: 41 BIC: 224.1
Df Model: 7
Covariance Type: nonrobust
coef std err t P>|t| [95.0% Conf. Int.]
x1 0.1594 37.615 0.004 0.997 -75.806 76.125
x2 0.1197 97.870 0.001 0.999 -197.533 197.772
x3 0.0491 90.344 0.001 1.000 -182.405 182.503
x4 0.0127 37.749 0.000 1.000 -76.224 76.249
x5 0.1326 91.308 0.001 0.999 -184.267 184.533
x6 0.4157 37.756 0.011 0.991 -75.833 76.665
x7 0.0124 91.421 0.000 1.000 -184.615 184.640
const 0 0 nan nan 0 0
Omnibus: 0.545 Durbin-Watson: 0.285
Prob(Omnibus): 0.761 Jarque-Bera (JB): 0.638
Skew: -0.225 Prob(JB): 0.727
Kurtosis: 2.669 Cond. No. 8.57e+04

Looking at these coefficients how do they compare to the values of the coefficients we had before? How would you consider the results of this regression as compared to the previous model?

We can cut off atributes from our model by adding some threshold value $\zeta$. So we only include predictor $x_j$ in the model if $\beta_j \geq \zeta$. Setting $\zeta = 0.05$ we can see how many factors we would include in our model as a function of $\lambda$.

In [26]:
lambdas = np.linspace(0,800,700)
zeta = 0.05
num_vars = []
mse = []
for lambd in lambdas:
    result_lasso = sm.OLS(Y,X).fit_regularized(alpha=lambd)
    coeffs = result_lasso.params
    num_pos = 0
    for beta in coeffs:
        if beta > 0.05:
            num_pos+= 1
    
    num_vars.append(num_pos)
    mse.append(result_lasso.mse_total)
In [24]:
import matplotlib.pyplot as plt
%matplotlib inline
plt.figure()
plt.plot(lambdas,num_vars)
Out[24]:
[<matplotlib.lines.Line2D at 0xdb856a0>]

Lets compare this plot to that of the MSE of our fit.

In [27]:
plt.figure()
plt.plot(lambdas,mse)
Out[27]:
[<matplotlib.lines.Line2D at 0xdc1e7b8>]

What can we say about our data by considering these two plots?

In the future we will consider more other techniques for model selection to help us avoid overfitting.