IEOR 165 Discussion 1: Method of Moments and Linear Regression

In this discussion session we will go over an example of Method of Moments with simulated data as well as an example with a real world data set for Linear Regression.

First some house keeping:

  • Name: Yonatan Mintz
  • Discussion Sessions (choose either one):
    • Section 1: F 10-11A, 310 Hearst Memorial Mining Building
    • Section 2: W 3-4P, 240 Bechtel
  • Office Hours: T/Th 3-4PM Etcheverry 4176B

Method of Moment Example: Analyzing Data from Arrivals in a Queue

Suppose we are managing a retail chain and we want to open a new location from one of 3 potential locations. Before commiting to a specific location we want to analyze the average amount of traffic which our store front would recieve where we to build it at that location. Based on initial investigation our analysts gather the following measurments from simmilar store fronts in our candidate locations:

Location\Time of Day 7:30AM - 9:30AM 9:30AM - 11:30AM 11:30AM-1:30PM 1:30PM - 3:30PM 3:30PM - 5:30PM 5:30PM - 7:30PM
1 10 6 9 5 17 12
2 5 6 10 10 4 9
3 11 8 8 11 11 9

We can do a perlinminary estimate of our data by using our naiiv variance and mean estimates. Recall that for mean $\mu = \mathbb{E}[X]$ we can construct our estimate $\hat{\mu} = \frac{1}{n} \sum_{i=1}^n X_i$, and for variance $\sigma^2 = \text{Var}(X)$ we have the estimate: $\hat{\sigma}^2 = \frac{1}{n} \sum_{i=1}^n X^2_i - \hat{\mu}^2$. Using these estimates what are the mean and variance of the traffic at each location?

In [9]:
import numpy as np
location_1 = np.array([10,6,9,5,17,12])
location_2 = np.array([5,6,10,10,4,9])
location_3 = np.array([11,8,8,11,11,9])

mean_1 = np.average(location_1)
mean_2 = np.average(location_2)
mean_3 = np.average(location_3)

variance_1 = np.average(location_1**2) - mean_1**2
variance_2 = np.average(location_2**2) - mean_2**2
variance_3 = np.average(location_3**2) - mean_3**2

print('The average daily traffic in Location 1 is: ' + str(mean_1) + ' and the vairance is: ' + str(variance_1))
print('The average daily traffic in Location 2 is: ' + str(mean_2) + ' and the vairance is: ' + str(variance_2))
print('The average daily traffic in Location 3 is: ' + str(mean_3) + ' and the vairance is: ' + str(variance_3))
The average daily traffic in Location 1 is: 9.83333333333 and the vairance is: 15.8055555556
The average daily traffic in Location 2 is: 7.33333333333 and the vairance is: 5.88888888889
The average daily traffic in Location 3 is: 9.66666666667 and the vairance is: 1.88888888889

However we can make more "reasonable" estimates of these quantities using our prior knowledge of the application. Recall the method of moments from class where we estimated parameters using estimates of their moments. Since $j^{th}$ moment of our data is given by $\mu_j = \mathbb{E}X^j$ we can easily compute an estimate for it as: $\hat{\mu}_j = \frac{1}{n} \sum_{i=1}^{n} X_i^j$ which is useful according to the Law of Large Numbers (LLN).

Since we are concerned with arrivals at the store one useful distribution to consider is the Poisson distribution $\text{Poiss}(\lambda)$. Many applications in queuing (such as call center traffic, bus/train arrivals at a particular station etc.) can be accurately discribed by a Poisson disribution. How can we apply this distribution to construct a method of moments estimate for the average traffic and variance in each location?

One nice porperty of the Poisson distribution is that if $X \sim \text{Poiss}(\lambda)$ then $\mathbb{E}[X] = Var(X) = \lambda$, so our method of moments estimate is simply the sample mean: $\hat{\lambda} = \frac{1}{n}\sum_{i=1}^n X_i$ This gives us the following estimates:

In [10]:
lambda_1 = np.average(location_1)
lambda_2 = np.average(location_2)
lambda_3 = np.average(location_3)

print('The average daily traffic in Location 1 is: ' + str(lambda_1) + ' and the vairance is: ' + str(lambda_1))
print('The average daily traffic in Location 2 is: ' + str(lambda_2) + ' and the vairance is: ' + str(lambda_2))
print('The average daily traffic in Location 3 is: ' + str(lambda_3) + ' and the vairance is: ' + str(lambda_3))
The average daily traffic in Location 1 is: 9.83333333333 and the vairance is: 9.83333333333
The average daily traffic in Location 2 is: 7.33333333333 and the vairance is: 7.33333333333
The average daily traffic in Location 3 is: 9.66666666667 and the vairance is: 9.66666666667

How do these values compare to our previous estimates? Also is there a method of moments interpretation for our initial estimation, if so what distribution are we assuming the data follows?

Linear Regression Example: Analyzing Data Set of Hotel Energy Consumption

In this example we will analyze a real world data set of Hotel Energy Consumption. First let's look at the data available to us, what sort of questions can we answer using linear regression analysis on this data set (which variables should be predictors and which should be dependant variables)?

In [6]:
import pandas as pd
hotel_data = pd.read_csv('hotel_energy.csv')
hotel_data.set_index('hotel',inplace='True')
print(hotel_data)
         enrgcons      area  age  numrooms  occrate  effrooms
hotel                                                        
1       1953916.0   43000.0    6       420     0.33    136.92
2       1045555.0   19979.0   16       215     0.63    135.45
3       4245313.0   46529.0    7       273     0.65    177.59
4       2126199.0   20962.0    6       222     0.71    156.51
5       2785958.0   24212.0    5       474     0.70    330.38
6      13833968.0  112200.0    4       787     0.49    385.39
7       5558123.0   45000.0    3       325     0.49    159.25
8       4001213.0   28548.0    6       199     0.52    104.02
9       4669758.0   32865.0    8       359     0.50    179.50
10      8924035.0   59406.0    5       503     0.58    290.03
11      6865534.0   45000.0   10       416     0.94    391.37
12      6014590.0   37435.0   13       418     0.69    287.17
13      8185738.0   50828.0    4       347     0.49    170.03
14     11736136.0   68000.0   13       455     0.64    292.11
15     14837426.0   78868.0    8       511     0.64    324.54
16      5366490.0   28454.0   13       219     0.77    167.62
17     13516215.0   70000.0    4       501     0.68    338.53
18      3884425.0   20000.0    5       197     0.66    130.02
19     10573409.0   50000.0   12       318     0.62    195.57

One question of interest may be to predict energy consumption using the independant variables in our data set (area, age, numrooms, occupancy rate, effective number of guest rooms). The type of model our analysis will produce will have the form:

$$ \text{ergcons}_i = a_1 \cdot \text{area}_i + a_2 \cdot \text{age}_i + a_3 \cdot \text{numrooms}_i + a_4 \cdot \text{occrate}_i + a_5 \cdot \text{effrooms}_i + a_0 + \epsilon_i$$

Where the index $i$ signifies the hotel number, and $\epsilon_i$ is ou measurment noise with $\mathbb{E}[\epsilon_i] = 0$ and $\mathbb{E}[\epsilon^2] < \infty$.

To obtain an estimate of the intercept we first add a collumn of ones to our initial data, then using Least squares we can compute our parameter values.

In [8]:
import statsmodels.formula.api as sm
import numpy as np
hotel_data['intercept'] = np.ones(( len(hotel_data), ))
X = hotel_data[['area','age','numrooms','occrate','effrooms','intercept']][:-1]
Y = hotel_data[['enrgcons']][:-1]
X = pd.DataFrame.as_matrix(X)
Y = pd.DataFrame.as_matrix(Y)

result = sm.OLS(Y,X).fit()
result.summary()
C:\Users\ymintz\Anaconda3\lib\site-packages\scipy\stats\stats.py:1535: UserWarning: kurtosistest only valid for n>=20 ... continuing anyway, n=18
  "anyway, n=%i" % int(n))
Out[8]:
OLS Regression Results
Dep. Variable: y R-squared: 0.888
Model: OLS Adj. R-squared: 0.841
Method: Least Squares F-statistic: 18.93
Date: Wed, 25 Jan 2017 Prob (F-statistic): 2.55e-05
Time: 18:13:15 Log-Likelihood: -280.34
No. Observations: 18 AIC: 572.7
Df Residuals: 12 BIC: 578.0
Df Model: 5
Covariance Type: nonrobust
coef std err t P>|t| [95.0% Conf. Int.]
x1 205.2864 34.559 5.940 0.000 129.989 280.583
x2 -1.151e+04 1.23e+05 -0.094 0.927 -2.79e+05 2.56e+05
x3 -3.763e+04 1.84e+04 -2.046 0.063 -7.77e+04 2433.196
x4 -1.323e+07 1.2e+07 -1.098 0.294 -3.95e+07 1.3e+07
x5 5.335e+04 2.92e+04 1.829 0.092 -1.02e+04 1.17e+05
const 7.397e+06 7.13e+06 1.038 0.320 -8.13e+06 2.29e+07
Omnibus: 1.770 Durbin-Watson: 1.074
Prob(Omnibus): 0.413 Jarque-Bera (JB): 1.295
Skew: -0.631 Prob(JB): 0.523
Kurtosis: 2.635 Cond. No. 1.78e+06

Analyzing the coefficients of our model we note that they are all extremeley large, why is this the case? In general is this sort of result ideal? If so why and if not what should we do to change the data?