IEOR 165 Discussion 3: Maximum Likelihood Estimation

The past few discussions we have considered linear models and linear regression.

In this discussion we will look at examples of Maximum Likelihood Estimation (MLE) and use this concept to derive what are known as generalized linear models.

Before we begin let's review some technical terminology.

Paramteric Model : a mathematical model for the underlying data generation process. A critical assumption is that this process can be characterized by a small set of parameters.

EX: If we let our data be $X_1,X_2,...,X_n$ then saying $X_i \stackrel{\text{i.i.d.}}{\sim} \mathcal{N}(\mu, \sigma^2)$ would be a paramteric model with paramters $\mu$ and $\sigma^2$

In general we can say our data is i.i.d and has some density function $f(x,\theta_1,\theta_2,...,\theta_d)$ (alternatively mass function for a discrete distribution), where the $\theta_i$ are a set of unknown paramters. Often we construct these models such that estimating these paramteres gives us all relevant information about the data.

Likelihood Function : A function of all the possible different paramters which outputs the conditional probability or probability density of observing the data given that set of paramters. If we have a single data point, we can express this with mathematics as $P(X = x | \theta_1,\theta_2,...,\theta_d)$ for a discrete random variable or $f(x|\theta_1,\theta_2,...,\theta_d)$ for a continuous random variable. Usually we denote this function as $\mathcal{L}(\theta_1,\theta_2,...,\theta_d)$.

If we have multiple data points which are i.i.d. we can write the likelihood function as: $$\mathcal{L}(\theta_1,\theta_2,...,\theta_d) = \prod_{i=1}^{n} f(x_i|\theta_1,\theta_2,...,\theta_d) $$

As described in lecture, MLE is a method we can use to estimate the parameters of our parametric model by using an optimization problem. In particular, we use $(\theta^*_1,\theta^*_2,...,\theta^*_d)$ the maximizers of $\max \mathcal{L}(\theta_1,\theta_2,...,\theta_d)$. In this discussion we will look at some applications of this technique.

Example 1: Calculating MLE for Arrivals in a Queue

Recall our retail example from the first discussion. In this example we were considering 3 potential locations for opening a new store location. Below is the data that we collected in terms of foot traffic for each of the locations.

Location\Time of Day 7:30AM - 9:30AM 9:30AM - 11:30AM 11:30AM-1:30PM 1:30PM - 3:30PM 3:30PM - 5:30PM 5:30PM - 7:30PM
1 10 6 9 5 17 12
2 5 6 10 10 4 9
3 11 8 8 11 11 9

One reasonable assumption we made about this data was that it is distributed with poisson distribution, and that the time intervals are independant. Does this assumption constitute a parametric model? If so what is the parameter and distribution familily we are considering?

So indeed we have 3 sets of data each with 6 i.i.d. observations. Recall that the probability mass function of a poisson distribution with rate $\lambda$ is given by: $$f(x) = \frac{\lambda^x e^{-\lambda}}{x!} $$

Using this information can you formulate the maximum likelihood estimation problem for 6 i.i.d. poisson observations?

$$\max_{\lambda > 0} \mathcal{L}(\lambda) = \max_{\lambda > 0} \prod_{i = 1}^{6} \frac{\lambda^{x_i} e^{-\lambda}}{x_i!} = \max_{\lambda > 0} \frac{\lambda^{(\textstyle\sum_{i=1}^6 x_i)}e^{-6\lambda}}{\prod_{i=1}^6 x_i!} $$

Solve this problem to obtain the MLE of $\lambda$ for 6 observations.

Since we have an unconstrained problem with a single parameter we are optimizing let us take the derivative and set it to zero. $$\frac{d \mathcal{L}}{d\lambda} = \frac{1}{\prod_{i=1}^6 x_i!}(-6\lambda^{(\textstyle\sum_{i=1}^6 x_i)}e^{-6\lambda} + (\textstyle\sum_{i=1}^6 x_i)\lambda^{(\textstyle\sum_{i=1}^6 x_i - 1)}e^{-6\lambda}) = 0$$

We can simplify this expression by noting $e^{-6\lambda} >0$ to obtain: $$(\textstyle\sum_{i=1}^6 x_i)\lambda^{(\textstyle\sum_{i=1}^6 x_i - 1)} = 6\lambda^{(\textstyle\sum_{i=1}^6 x_i)}$$ $$\frac{1}{6} \sum_{i=1}^6 x_i = \lambda^*$$

What is our mle estimate for the mean and variance of our data set? How does this differ from our method of moments estimator?

In [12]:
import numpy as np
location_1 = np.array([10,6,9,5,17,12])
location_2 = np.array([5,6,10,10,4,9])
location_3 = np.array([11,8,8,11,11,9])

mean_1 = np.average(location_1)
mean_2 = np.average(location_2)
mean_3 = np.average(location_3)

print('The average daily traffic in Location 1 is: ' + str(mean_1) + ' and the vairance is: ' + str(mean_1))
print('The average daily traffic in Location 2 is: ' + str(mean_2) + ' and the vairance is: ' + str(mean_2))
print('The average daily traffic in Location 3 is: ' + str(mean_3) + ' and the vairance is: ' + str(mean_3))
The average daily traffic in Location 1 is: 9.83333333333 and the vairance is: 9.83333333333
The average daily traffic in Location 2 is: 7.33333333333 and the vairance is: 7.33333333333
The average daily traffic in Location 3 is: 9.66666666667 and the vairance is: 9.66666666667

In general these two estimates do not agree but for many common distributions under i.i.d. assumptions they will be essentially the same.

Example 2: Logistic Regression and Generalized Linear Models

Recall from our discussion of linear regression our general form for a linear model. Given a predictor vector $x$ with $d$ entries and a response $y$ we have considered models of the form: $$y_i = \sum_{j=1}^d\beta_j x_{i,j} + \beta_0 + \epsilon_i$$

In general we assumed that the $\epsilon_i$ are i.i.d. and $\mathbb{E}[\epsilon_i] = 0 $ and $\mathbb{E}[\epsilon_i^2] < \infty$, and used OLS to solve our estimation problem. Recall that the OLS problem can be written as:

$$ \min_{\beta} \sum_{i=1}^n (y_i - \sum_{j=1}^d\beta_j x_{ij} + \beta_0 )^2$$

Is there an MLE interpretation for this problem? If so what is it and what additional assumption do we have to make?

Since this corresponds to MLE of gaussian noise this suggests that these linear models are best used for continuous response data but perhaps not for categorical or discrete data. For instance lets consider the following data set showing whether or not students passed the previous IEOR 165 final exam and how many hours of sleep and sudying they had the week prior.

In [3]:
import pandas as pd
student_data = pd.read_csv('student_data.csv')
print(student_data.head(20))
    passed_test  sleep_hours  study_hours
0             1           44           68
1             0           19           25
2             0           79            7
3             0           16           26
4             1           12           25
5             0           71           22
6             1           23            9
7             1            8           67
8             1           78           23
9             1           13           27
10            0           32           37
11            1           57           57
12            1           27           38
13            0           25            8
14            1           18           32
15            1           35           34
16            0           36           10
17            1           21           23
18            0           57           15
19            1           75           25

If we wanted to form a model from this data which values should we use as our predictors and which should be our response variable? Does this data seem like it would satisfy our assumptions for linear regression?

The linear model we would construct would be of the form: $$\text{passed_test}_i = \beta_1\cdot \text{sleep_hours}_i + \beta_2\cdot \text{study_hours} + \beta_0 + \epsilon_i$$

We can construct a model for this data using noitions from MLE. Given the values our response variable takes what is a reasonable distribution to consider?

We note $y \in \{0,1\}$ so one model we can use is the Bernouli distribution which is often parametrized by its success probability $p$. Recall that the p.m.f. of a Bernoili distribution is given by: $$f(y) = \begin{cases}1-p \quad \text{if } x= 0 \\ p \quad \text{if } x= 1 \end{cases} $$

We can quickly see that $\hat{p}_{mle} = \frac{1}{n}\sum_{i=1}^{n} y_i$ which is not very useful since it doesn't use our predictors. We would like to use something like the linear model from regression but this is problematic since $0 \leq p \leq 1$ but a linear function can take any value on the real line. Its therefore useful to transform our parameter to one which can take any such value. One useful reparametrization of this value in this case is the log-odds ratio which can be written as $\ln (\frac{p}{1-p})$. So we can write a linear model of the form:

$$\ln (\frac{p_i}{1-p_i}) = \sum_{j=1}^d\beta_j x_{ij} + \beta_0$$

Solving for $p_i$ we obtain that: $p_i = \frac{1}{1 + \exp(-\sum_{j=1}^d\beta_j x_{ij} - \beta_0)} $. This is known as a logistic function and hense this technique is known in general as logistic regression . We can formulate our regression problem using MLE. Note that the p.m.f can be written as $f(y) = p^y(1-p)^{(1-y)} = \exp(y \ln(\frac{p}{1-p}) + \ln(1-p))$.

Doing some algebra and taking the negative log transform we obtain the following MLE problem: $$\min \sum_{i=1}^n y_i(\sum_{j=1}^d\beta_j x_{ij} + \beta_0) - \sum_{i=1}^n\ln(1+ \exp(\sum_{j=1}^d\beta_j x_{ij} + \beta_0))$$

Since this is a very commonly used regression technique there are many specialized software packages that can solve this problem for us.

In [8]:
import statsmodels.api as sm
import numpy as np
student_data['intercept'] = np.ones(( len(student_data), ))
X = student_data[['sleep_hours','study_hours','intercept']][:-1]
Y = student_data['passed_test'][:-1]
X = pd.DataFrame.as_matrix(X)
Y = pd.DataFrame.as_matrix(Y)

success_prob_model = sm.GLM(Y, X, family=sm.families.Binomial())
result = success_prob_model.fit()
print(result.summary())
                 Generalized Linear Model Regression Results                  
==============================================================================
Dep. Variable:                      y   No. Observations:                   49
Model:                            GLM   Df Residuals:                       46
Model Family:                Binomial   Df Model:                            2
Link Function:                  logit   Scale:                             1.0
Method:                          IRLS   Log-Likelihood:                -28.841
Date:                Wed, 08 Feb 2017   Deviance:                       57.682
Time:                        13:02:55   Pearson chi2:                     47.9
No. Iterations:                     6                                         
==============================================================================
                 coef    std err          z      P>|z|      [95.0% Conf. Int.]
------------------------------------------------------------------------------
x1             0.0116      0.013      0.880      0.379        -0.014     0.037
x2             0.0523      0.020      2.615      0.009         0.013     0.091
const         -1.6639      0.820     -2.029      0.042        -3.271    -0.057
==============================================================================

Using these results what would we predict the probability of a student passing the final given they had 70 hours of sleep and spent 14 hours of studying the week prior to the exam?

In [9]:
success_probability = 1.0/(1.0 + np.exp(-0.0116*70 - 0.0523*14 + 1.6639))
print(success_probability)
0.470110679553

As we can see in the code Logistic Regression falls under the class of models known as general linear models. If we have a parametric model for our response variable $y$ with parameter $\theta$ the generalized linear model can be written as: $$g(\theta(x_i)) = \sum_{j=1}^d\beta_j x_{i,j} + \beta_0 $$

Where (usually) $g(\theta)$ is invertible. This class of models allows us to consider more complicated relationships in our data and is a very powerful data analytics tool.