In the previous discussions we discussed linear models and maximum likelihood estimation. For this discussion we will analyze the performance of these techniques using bias and variance and generalize them with MAP estimation. We will see how we can use these notions to perform model selection.
Suppose we have a data set given by $X_1,X_2,...,X_n$ to which we want to fit some statistical model with a paramter $\beta$. Let $\hat{\beta}$ be the estimator of $\beta$, recall from lecture that to analyze the quality of this estimate we can consider its bias and its variance defined as: $$\text{bias}(\hat{\beta}) = \mathbb{E}[\hat{\beta}] - \beta $$ $$ \text{Var}(\hat{\beta}) = \mathbb{E}[\hat{\beta}^2] -(\mathbb{E}[\hat{\beta}])^2$$
$\text{bias}(\hat{\beta})$ measures wether we would estimate the correct paramters on average while $\text{Var}(\hat{\beta})$ measures how sensitive our estimate is to noise in the data. Recall that $\text{bias}(\hat{\beta})^2 + \text{Var}(\hat{\beta}) = \mathbb{E}[(\hat{\beta}-\beta)^2]$ which forms the basis for the tradeoff between these two terms. Let's use these notions to analyze some estimates formed from parametric models.
Let $X_1,X_2,...,X_n$ be i.i.d random variables such that $X_i \sim \text{Exp}(\lambda)$. What is the MLE estimate of the parameter $\lambda$?
We note that $\mathcal{L}(\lambda) = \lambda^n \exp(-\lambda \sum_{i=1}^nX_i)$, so taking the negative log-likelihood we obtain that the optimization problem we want to solve is: $$\min\limits_{\lambda>0} \lambda \sum_{i=1}^nX_i - n\ln \lambda $$
Taking the derivative and setting it equals to zero gives that $\hat{\lambda}_{mle} = \frac{n}{\sum_{i=1}^nX_i}$. What is the bias of this MLE estimate?
Recall $\text{bias}(\hat{\lambda}_{mle}) = \mathbb{E}[\hat{\lambda}_{mle} - \lambda]$. so we need to compute $\mathbb{E}[\hat{\lambda}_{mle}] = \mathbb{E}[\frac{n}{\sum_{i=1}^nX_i}]$. Note that the joint likelihood we computed previously is the joint distribution of the $X_i$'s so we want to solve the multiple integral: $$\int_0^\infty \int_0^\infty .... \int_0^\infty \frac{n\lambda^n}{\sum_{i=1}^nx_i}\exp(-\lambda \sum_{i=1}^nx_i) dx_1 dx_2 ... dx_n$$
Lets define a variable $Y = \sum_{i=1}^nX_i$, we know that the sum of i.i.d. exponentials is distributed Erlang (a special case of the gamma distribution) and so $Y \sim \text{Erlang}(n,\lambda)$, which has pdf $f(y) = \frac{\lambda^ny^{n-1}\exp(-\lambda y)}{(n-1)!}$. Using this change of variable we can write the above integral as: $$n \int_0^\infty \frac{y^{-1}y^{n-1}\lambda^n\exp(-\lambda y)}{(n-1)!} dy = \frac{n\lambda}{(n-1)}\int_0^\infty \frac{y^{n-2}\lambda^{n-1}\exp(-\lambda y)}{(n-2)!} dy =$$ $$ \frac{n\lambda}{(n-1)} $$
So the bias is given by $\frac{n}{(n-1)}\lambda -\lambda = \frac{1}{n-1}\lambda$. Is this estimator unbiased? What is the variance of this estimator?
Using the same change of variable argument we can write $\text{Var}(\hat{\lambda}_{mle}) = n^2\text{Var}(\frac{1}{Y}) = n^2(\mathbb{E}[\frac{1}{Y^2}] - \mathbb{E}[\frac{1}{Y}]^2)$. Since we already calaculated $\mathbb{E}[\frac{1}{Y}] $ we need to simply compute the second moment: $$ \mathbb{E}[\frac{1}{Y^2}] = \int_0^\infty \frac{y^{n-3}\lambda^n\exp(-\lambda y)}{(n-1)!} dy = \frac{\lambda^{2}}{(n-1)(n-2)} \int_0^\infty \frac{y^{n-3}\lambda^{n-2}\exp(-\lambda y)}{(n-3)!} dy = \frac{\lambda^{2}}{(n-1)(n-2)}$$
So: $$\text{Var}(\hat{\lambda}_{mle}) = n^2(\frac{\lambda^{2}}{(n-1)(n-2)} - \frac{\lambda^2}{(n-1)^2} )= \frac{n^2\lambda^2}{(n-1)^2(n-2)}$$
What is the mean squared error of this estimator?
Lets use the formula: $\text{MSE}(\hat{\lambda}_{mle}) = \text{bias}(\hat{\beta})^2 + \text{Var}(\hat{\beta}) =\frac{\lambda^2}{(n-1)^2}+ \frac{n^2\lambda^2}{(n-1)^2(n-2)} = \frac{\lambda^2(n^2 + n - 2)}{(n-1)^2(n-2)} = \frac{\lambda^2(n+ 2)}{(n-1)(n-2)}$
Suppose instead we want to construct an unbiased estimator of the parameter $\lambda$ how could we do this using $\hat{\lambda}_{mle}$?
One simple way of doing this is by scaling. Recall $\mathbb{E}[\hat{\lambda}_{mle}] = \frac{n\lambda}{(n-1)}$, so if we let $\hat{\lambda}_{unb} = \frac{n-1}{n} \hat{\lambda}_{mle}$ we can clearly see we have an unbiased estimator. What is the variance of this estimator?
Using properties of variance: $\text{Var}(\hat{\lambda}_{unb}) = (\frac{n-1}{n})^2 \text{Var}(\hat{\lambda}_{mle}) = \frac{\lambda^2}{n-2}$. What is the mean squared error of this new estimator? How does it compare to the mean squared error of the previous estimator?
Since this is an unbiased estimator the MSE is simply the variance. We note that $\frac{\lambda^2(n+ 2)}{(n-1)(n-2)} - \frac{\lambda^2}{n-2} = \frac{3\lambda^2}{(n-1)(n-2)} > 0$ so we have decreased the MSE. This suggests that we can form a shrinkege estimator of the form $\hat{\lambda}_{mse} = k \hat{\lambda}_{mle}$ which will minimize MSE, what is this estimator?
Let's plug in this shrinkage formulat to the MSE Problem:
$$\min_k \mathbb{E}[(k\hat{\lambda}_{mle} - \lambda)^2] = \min_k k^2 \mathbb{E}[\hat{\lambda}_{mle}^2] - 2k\lambda\mathbb{E}[\hat{\lambda}_{mle}] + \lambda^2 $$Putting in and doing some algebriac simplification shows that this problem has the same minimizer as $\min_k k^2n^2 -2k(n-2)n $. Taking the derivative and setting to zero shows that $k = \frac{n-2}{n}$. Is this new estimator unbiased? What is the MSE of this estimator?
Clearly this estimator is biased since $\mathbb{E}[\hat{\lambda}_{mse}] = \frac{n-2}{n} \mathbb{E}[\hat{\lambda}_{mle}] = \frac{n-2}{n-1}\lambda$. The variance is given by $\text{Var}(\hat{\lambda}_{mse}) = (\frac{n-2}{n})^2\text{Var}(\hat{\lambda}_{mle}) = \frac{(n-2)\lambda^2}{(n-1)^2}$. Hence our mean squared error is: $$\text{MSE}(\hat{\lambda}_{mse}) = \frac{\lambda^2}{(n-1)^2} + \frac{(n-2)\lambda^2}{(n-1)^2} = \frac{\lambda^2}{n-1}$$
In this example we will explore using regularization and MAP to calibrate temperature sensors in a room. In particular, we will consider how these methods can be used for model selection. Suppose we have placed 7 sensors in an array in a room and used an independant thermometer to record the temperature in the room every hour for 50 hours. Below is the data we obtain from these measurements:
import pandas as pd
temp_data = pd.read_csv('temp_data.csv')
print(temp_data.head(20))
Given that this data is continuous, it seems reasonable to assume that we can use a linear model. What should we use as our resposne and what should be our predictors?
Lets set temperature as our response and the sensor readings as the predictors. So a potential linear model we can fit is of the form: $$\text{temperature}_i = \sum_{j=1}^7 \beta_j \cdot \text{sensor}j_i + \beta_0 + \epsilon_i $$
As before we can use linear regression to fit this model to our data and we obtain the following regression results:
import statsmodels.formula.api as sm
import numpy as np
temp_data['intercept'] = np.ones(( len(temp_data), ))
X = temp_data[['sensor1','sensor2','sensor3','sensor4','sensor5','sensor6','sensor7','intercept']][:-1]
Y = temp_data['temperature'][:-1]
X = pd.DataFrame.as_matrix(X)
Y = pd.DataFrame.as_matrix(Y)
result = sm.OLS(Y,X).fit()
result.summary()
Looking at these regression results what do we think about the outcome of this regression?
Seeing as the $R^2$ is 1.0 we should be weary that we are overfitting the model, that is we may be including too much flexibility in our model and not identifying a true trend. In fact we see that some factors in our model have negative coefficients while others have positive ones, as if the linear model is subtracting correlated variables. One way of avoiding overfitting is to perform model selection. Instead of using all the data we will use only the data from some subset of the sensors. While there are many methods for doing this, one of the prefered methods in application is through regularization.
Suppose we assume a Laplacian prior on our data, in lecture we saw this as the equivalent of solving the following optimization problem: $$ \min \sum_{i=1}^{50} (\text{temperature}_i - (\sum_{j=1}^7 \beta_j \cdot \text{sensor}j_i + \beta_0))^2 + \lambda\sum_{j =1}^7|\beta_j|$$
A nice property of this regularization is that by increasing the size of $\lambda$ we can effectively decrease the number of factors we include in our model by reducing the value of coefficients of irrelevant variables. Using the same python package as before we can use regularization to fit a lasso regression to our model as follows:
result_lasso = sm.OLS(Y,X).fit_regularized(alpha=100)
result_lasso.summary()
Looking at these coefficients how do they compare to the values of the coefficients we had before? How would you consider the results of this regression as compared to the previous model?
We can cut off atributes from our model by adding some threshold value $\zeta$. So we only include predictor $x_j$ in the model if $\beta_j \geq \zeta$. Setting $\zeta = 0.05$ we can see how many factors we would include in our model as a function of $\lambda$.
lambdas = np.linspace(0,800,700)
zeta = 0.05
num_vars = []
mse = []
for lambd in lambdas:
result_lasso = sm.OLS(Y,X).fit_regularized(alpha=lambd)
coeffs = result_lasso.params
num_pos = 0
for beta in coeffs:
if beta > 0.05:
num_pos+= 1
num_vars.append(num_pos)
mse.append(result_lasso.mse_total)
import matplotlib.pyplot as plt
%matplotlib inline
plt.figure()
plt.plot(lambdas,num_vars)
Lets compare this plot to that of the MSE of our fit.
plt.figure()
plt.plot(lambdas,mse)
What can we say about our data by considering these two plots?
In the future we will consider more other techniques for model selection to help us avoid overfitting.