In this discussion session we will go over an example of Method of Moments with simulated data as well as an example with a real world data set for Linear Regression.
First some house keeping:
Suppose we are managing a retail chain and we want to open a new location from one of 3 potential locations. Before commiting to a specific location we want to analyze the average amount of traffic which our store front would recieve where we to build it at that location. Based on initial investigation our analysts gather the following measurments from simmilar store fronts in our candidate locations:
Location\Time of Day | 7:30AM - 9:30AM | 9:30AM - 11:30AM | 11:30AM-1:30PM | 1:30PM - 3:30PM | 3:30PM - 5:30PM | 5:30PM - 7:30PM |
---|---|---|---|---|---|---|
1 | 10 | 6 | 9 | 5 | 17 | 12 |
2 | 5 | 6 | 10 | 10 | 4 | 9 |
3 | 11 | 8 | 8 | 11 | 11 | 9 |
We can do a perlinminary estimate of our data by using our naiiv variance and mean estimates. Recall that for mean $\mu = \mathbb{E}[X]$ we can construct our estimate $\hat{\mu} = \frac{1}{n} \sum_{i=1}^n X_i$, and for variance $\sigma^2 = \text{Var}(X)$ we have the estimate: $\hat{\sigma}^2 = \frac{1}{n} \sum_{i=1}^n X^2_i - \hat{\mu}^2$. Using these estimates what are the mean and variance of the traffic at each location?
import numpy as np
location_1 = np.array([10,6,9,5,17,12])
location_2 = np.array([5,6,10,10,4,9])
location_3 = np.array([11,8,8,11,11,9])
mean_1 = np.average(location_1)
mean_2 = np.average(location_2)
mean_3 = np.average(location_3)
variance_1 = np.average(location_1**2) - mean_1**2
variance_2 = np.average(location_2**2) - mean_2**2
variance_3 = np.average(location_3**2) - mean_3**2
print('The average daily traffic in Location 1 is: ' + str(mean_1) + ' and the vairance is: ' + str(variance_1))
print('The average daily traffic in Location 2 is: ' + str(mean_2) + ' and the vairance is: ' + str(variance_2))
print('The average daily traffic in Location 3 is: ' + str(mean_3) + ' and the vairance is: ' + str(variance_3))
However we can make more "reasonable" estimates of these quantities using our prior knowledge of the application. Recall the method of moments from class where we estimated parameters using estimates of their moments. Since $j^{th}$ moment of our data is given by $\mu_j = \mathbb{E}X^j$ we can easily compute an estimate for it as: $\hat{\mu}_j = \frac{1}{n} \sum_{i=1}^{n} X_i^j$ which is useful according to the Law of Large Numbers (LLN).
Since we are concerned with arrivals at the store one useful distribution to consider is the Poisson distribution $\text{Poiss}(\lambda)$. Many applications in queuing (such as call center traffic, bus/train arrivals at a particular station etc.) can be accurately discribed by a Poisson disribution. How can we apply this distribution to construct a method of moments estimate for the average traffic and variance in each location?
One nice porperty of the Poisson distribution is that if $X \sim \text{Poiss}(\lambda)$ then $\mathbb{E}[X] = Var(X) = \lambda$, so our method of moments estimate is simply the sample mean: $\hat{\lambda} = \frac{1}{n}\sum_{i=1}^n X_i$ This gives us the following estimates:
lambda_1 = np.average(location_1)
lambda_2 = np.average(location_2)
lambda_3 = np.average(location_3)
print('The average daily traffic in Location 1 is: ' + str(lambda_1) + ' and the vairance is: ' + str(lambda_1))
print('The average daily traffic in Location 2 is: ' + str(lambda_2) + ' and the vairance is: ' + str(lambda_2))
print('The average daily traffic in Location 3 is: ' + str(lambda_3) + ' and the vairance is: ' + str(lambda_3))
How do these values compare to our previous estimates? Also is there a method of moments interpretation for our initial estimation, if so what distribution are we assuming the data follows?
In this example we will analyze a real world data set of Hotel Energy Consumption. First let's look at the data available to us, what sort of questions can we answer using linear regression analysis on this data set (which variables should be predictors and which should be dependant variables)?
import pandas as pd
hotel_data = pd.read_csv('hotel_energy.csv')
hotel_data.set_index('hotel',inplace='True')
print(hotel_data)
One question of interest may be to predict energy consumption using the independant variables in our data set (area, age, numrooms, occupancy rate, effective number of guest rooms). The type of model our analysis will produce will have the form:
$$ \text{ergcons}_i = a_1 \cdot \text{area}_i + a_2 \cdot \text{age}_i + a_3 \cdot \text{numrooms}_i + a_4 \cdot \text{occrate}_i + a_5 \cdot \text{effrooms}_i + a_0 + \epsilon_i$$Where the index $i$ signifies the hotel number, and $\epsilon_i$ is ou measurment noise with $\mathbb{E}[\epsilon_i] = 0$ and $\mathbb{E}[\epsilon^2] < \infty$.
To obtain an estimate of the intercept we first add a collumn of ones to our initial data, then using Least squares we can compute our parameter values.
import statsmodels.formula.api as sm
import numpy as np
hotel_data['intercept'] = np.ones(( len(hotel_data), ))
X = hotel_data[['area','age','numrooms','occrate','effrooms','intercept']][:-1]
Y = hotel_data[['enrgcons']][:-1]
X = pd.DataFrame.as_matrix(X)
Y = pd.DataFrame.as_matrix(Y)
result = sm.OLS(Y,X).fit()
result.summary()
Analyzing the coefficients of our model we note that they are all extremeley large, why is this the case? In general is this sort of result ideal? If so why and if not what should we do to change the data?