IEOR 165 Discussion 2: Linear Regression Diagnostics

Last discussion we went over the method of moments and an example of multiple regression with a real world data set. In this discussion we will analyze a data set using diagnostic tools from linear regression.

Before we begin just a reminder:

  • Name: Yonatan Mintz
  • Discussion Sessions (choose either one):
    • Section 1: F 10-11A, 310 Hearst Memorial Mining Building
    • Section 2: W 3-4P, 240 Bechtel
  • Office Hours: T/Th 3-4PM Etcheverry 4176B

Recall that in linear regression, we are given data of the form $(x_i,y_i)$ and are interested in finding a linear relationship of the form: $$y_i = \beta_1 \cdot x_i + \beta_0 + \epsilon_i $$

Where $\epsilon_i$ are i.i.d random variables with $\mathbb{E}[\epsilon_i] = 0$ and $\mathbb{E}[\epsilon_i^2] < \infty$. These assumptions are reasonable for many models, esspecially since many quantities of interest in real world applications are bounded. When $x_i$ and $\beta_1$ are vectors then the model is known as multiple linear regression. In this discussion we will conduct a multiple linear regression analysis on a data set.

Linear Regression Example 1: Movment of Object

Consider the following data measurments taken every second of the location of an object. We can think of these as measurments taken for instance from a car's location based on GPS or other location services or the location of a drone as measured from sensors. In these cases we know that there is some measurment noise in what we are observing so we need to analyze the data to find the true location of our target. Here are the observations below:

In [6]:
import pandas as pd
location_data = pd.read_csv('object_movement_data.csv')
#location_data.set_index('time',inplace='True')
print(location_data.head(20))
    time  location_of_object
0      0           26.243454
1      1           13.632436
2      2           23.718282
3      3           27.020314
4      4           54.654076
5      5           30.734613
6      6           78.448118
7      7           60.137931
8      8           77.190391
9      9           77.256296
10    10           99.621079
11    11           69.148593
12    12           90.775828
13    13           93.909456
14    14          112.337694
15    15           92.751087
16    16          104.275718
17    17           98.971416
18    18          109.422138
19    19          115.578152

Looking at this data what should we use as our predictor and what should be our response variable?

Using linear regression the sort of model we would be interested in is of the form: $$\text{location_of_object}_i = \text{time}_i\cdot\beta_1 + \beta_0 + \epsilon $$ So using our tools from before let's calculate these coefficients using OLS:

In [7]:
import statsmodels.formula.api as sm
import numpy as np
location_data['intercept'] = np.ones(( len(location_data), ))
X = location_data[['time','intercept']][:-1]
Y = location_data['location_of_object'][:-1]
X = pd.DataFrame.as_matrix(X)
Y = pd.DataFrame.as_matrix(Y)
result = sm.OLS(Y,X).fit()
result.summary()
Out[7]:
OLS Regression Results
Dep. Variable: y R-squared: 0.274
Model: OLS Adj. R-squared: 0.259
Method: Least Squares F-statistic: 17.75
Date: Fri, 03 Feb 2017 Prob (F-statistic): 0.000113
Time: 10:22:33 Log-Likelihood: -255.37
No. Observations: 49 AIC: 514.7
Df Residuals: 47 BIC: 518.5
Df Model: 1
Covariance Type: nonrobust
coef std err t P>|t| [95.0% Conf. Int.]
x1 -1.9282 0.458 -4.213 0.000 -2.849 -1.007
const 101.8905 12.751 7.991 0.000 76.239 127.542
Omnibus: 6.587 Durbin-Watson: 0.134
Prob(Omnibus): 0.037 Jarque-Bera (JB): 4.236
Skew: -0.553 Prob(JB): 0.120
Kurtosis: 2.076 Cond. No. 54.9

Let us evaluate our regression. Does this model seem like a good fit based on this diagnostic?

One way we can get more insights into our model is to graph our estimated model against the data. If we do so we obtain the following:

In [8]:
import matplotlib.pyplot as plt
%matplotlib inline
plt.figure()
Y_hat = -1.9282*X[:,0] + 101.8905
plt.plot(X[:,0],Y,'o',X[:,0],Y_hat,'-r')
Out[8]:
[<matplotlib.lines.Line2D at 0xbbdd358>,
 <matplotlib.lines.Line2D at 0xbbdd518>]

Another useful diagnostic of our model would be to plot the residuals, which should give as an estimate of our noise $\epsilon$. Let us look at the residual plot:

In [9]:
%matplotlib inline
plt.figure()
plt.plot(X[:,0],Y-Y_hat,'o')
Out[9]:
[<matplotlib.lines.Line2D at 0xbc50b70>]

Does this residual plot satisfy our assumptions on the model? How can we improve our analysis?

One powerful aspect of linear regression is that models don't need to be linear in the data but only in the coeffecients. This is a very powerful modeling tool since it allows us to express a wide variety of functions. For instance we can consider a polynomial of the form: $$y_i = \sum\limits_{k=1}^N \beta_k\cdot x^k + \beta_0 + \epsilon_i $$

Can we use this to improve our own analysis? How should we transform our data?

Based on our plots the data seems to be parabolic so a model we could consider is of the form: $$\text{location_of_object}_i = \text{time}_i\cdot\beta_1 + \text{time}^2_i\cdot\beta_2 + \beta_0 + \epsilon $$

We can solve for these coefficients using the same OLS solver we used previously. This can be done by adding one more collumn to our data set, this time corresponding to $\text{time}^2$

In [10]:
%matplotlib inline
location_data['time_squared'] = location_data['time']**2
location_data['intercept'] = np.ones(( len(location_data), ))

X = location_data[['time','time_squared','intercept']][:-1]
Y = location_data['location_of_object'][:-1]
X = pd.DataFrame.as_matrix(X)
Y = pd.DataFrame.as_matrix(Y)
result = sm.OLS(Y,X).fit()
result.summary()
Out[10]:
OLS Regression Results
Dep. Variable: y R-squared: 0.966
Model: OLS Adj. R-squared: 0.965
Method: Least Squares F-statistic: 654.0
Date: Fri, 03 Feb 2017 Prob (F-statistic): 1.65e-34
Time: 10:27:23 Log-Likelihood: -180.36
No. Observations: 49 AIC: 366.7
Df Residuals: 46 BIC: 372.4
Df Model: 2
Covariance Type: nonrobust
coef std err t P>|t| [95.0% Conf. Int.]
x1 9.7049 0.393 24.693 0.000 8.914 10.496
x2 -0.2424 0.008 -30.608 0.000 -0.258 -0.226
const 10.7642 4.079 2.639 0.011 2.554 18.975
Omnibus: 0.850 Durbin-Watson: 2.456
Prob(Omnibus): 0.654 Jarque-Bera (JB): 0.896
Skew: 0.182 Prob(JB): 0.639
Kurtosis: 2.447 Cond. No. 3.03e+03

How do the results of this regression compare to our previous regression?

Now let's consider the residual plot and fit prlot of our new regression model, how do these compare to our previous analysis?

In [11]:
%matplotlib inline
Y_hat = 9.7049*X[:,0] -0.2424*X[:,1] + 10.7642
plt.plot(X[:,0],Y,'o',X[:,0],Y_hat,'-r')
plt.figure()
plt.plot(X[:,0],Y-Y_hat,'o')
Out[11]:
[<matplotlib.lines.Line2D at 0xc1314a8>]

Linear Regression Example 2: Using Multiple Regression to Estimate Demand of a Product

Suppose we are interested in forcasting the demand for a certain product given the following marketing data from previous similar products about price level and amount of money spent on an advertising campaign.

In [12]:
marketing_data = pd.read_csv('marketing_data.csv')
#location_data.set_index('time',inplace='True')
print(marketing_data.head(20))
    ad_buy  price       sales
0       58   53.7  445.927030
1      191   73.5  530.612059
2      165   89.6  536.116398
3      171   57.2  503.308019
4       80   75.5  485.009633
5      121   89.3  511.427117
6      181   70.3  525.762721
7      199   63.3  495.033582
8       99   83.5  482.302370
9      107   94.8  525.598761
10      53   64.4  465.249682
11      74   62.9  464.065529
12      93   96.0  509.740927
13     126   57.1  463.002748
14      76   73.7  457.823505
15     102   89.0  516.192738
16     130   78.1  496.953603
17     159   67.8  516.716043
18     165   77.6  525.624006
19      91   75.4  484.073486

Looking at this data which variables should we use a predictors and which variable should be our response?

As we saw in our previous example, visualizing the data with a scatter plot can be invaluable for us to detrmine what kind of model we want to fit. Lets consider the following two scatter plots of our data.

In [13]:
%matplotlib inline
ad_buy_plot = plt.scatter(marketing_data['ad_buy'],marketing_data['sales'])
plt.xlabel('ad_buy')
plt.ylabel('sales')

plt.figure()
plt.scatter(marketing_data['price'],marketing_data['sales'])
plt.xlabel('price')
plt.ylabel('sales')
Out[13]:
<matplotlib.text.Text at 0xc1c8208>

These residuals look some what correlated, this seems like both of these predictors would be useful to estimate our response. Therefore, the model we want to estimate with OLS is of the form: $$\text{sales}_i = \beta_2 \cdot \text{ad_buy}_i + \beta_1 \cdot \text{price}_i + \beta_0 + \epsilon_i $$

Runing our OLS we obtain the following regression results:

In [14]:
marketing_data['intercept'] = np.ones(( len(location_data), ))

X = marketing_data[['ad_buy','price','intercept']][:-1]
Y = marketing_data['sales'][:-1]
X = pd.DataFrame.as_matrix(X)
Y = pd.DataFrame.as_matrix(Y)
result = sm.OLS(Y,X).fit()
result.summary()
Out[14]:
OLS Regression Results
Dep. Variable: y R-squared: 0.915
Model: OLS Adj. R-squared: 0.911
Method: Least Squares F-statistic: 247.7
Date: Fri, 03 Feb 2017 Prob (F-statistic): 2.36e-25
Time: 10:37:51 Log-Likelihood: -176.56
No. Observations: 49 AIC: 359.1
Df Residuals: 46 BIC: 364.8
Df Model: 2
Covariance Type: nonrobust
coef std err t P>|t| [95.0% Conf. Int.]
x1 0.4345 0.029 14.787 0.000 0.375 0.494
x2 1.3639 0.089 15.319 0.000 1.185 1.543
const 338.8541 7.607 44.547 0.000 323.543 354.166
Omnibus: 1.139 Durbin-Watson: 1.812
Prob(Omnibus): 0.566 Jarque-Bera (JB): 0.970
Skew: -0.096 Prob(JB): 0.616
Kurtosis: 2.338 Cond. No. 871.

Looking at the regression results, what do you think is the quality of our fit? Is this surprising?

Interestingly enough the actual model used to generate this data is of the following form: $$\text{sales}_i = \beta_2 \cdot \text{ad_buy}_i + \beta_1 \cdot \ln(\text{price}_i) + \beta_0 + \epsilon_i $$

Like we did with the polynomial, we can again solve for these coeffecients using OLS:

In [15]:
marketing_data['log_price'] = np.log(marketing_data['price'])

X = marketing_data[['ad_buy','log_price','intercept']][:-1]
Y = marketing_data['sales'][:-1]
X = pd.DataFrame.as_matrix(X)
Y = pd.DataFrame.as_matrix(Y)
result = sm.OLS(Y,X).fit()
result.summary()
Out[15]:
OLS Regression Results
Dep. Variable: y R-squared: 0.916
Model: OLS Adj. R-squared: 0.913
Method: Least Squares F-statistic: 251.9
Date: Fri, 03 Feb 2017 Prob (F-statistic): 1.66e-25
Time: 10:39:13 Log-Likelihood: -176.18
No. Observations: 49 AIC: 358.4
Df Residuals: 46 BIC: 364.0
Df Model: 2
Covariance Type: nonrobust
coef std err t P>|t| [95.0% Conf. Int.]
x1 0.4265 0.029 14.605 0.000 0.368 0.485
x2 100.2894 6.487 15.460 0.000 87.232 113.347
const 11.1953 27.973 0.400 0.691 -45.112 67.503
Omnibus: 0.754 Durbin-Watson: 1.920
Prob(Omnibus): 0.686 Jarque-Bera (JB): 0.840
Skew: -0.189 Prob(JB): 0.657
Kurtosis: 2.482 Cond. No. 2.88e+03

Compare these regression results to the previous model, do we have a segnificantly better fit? What conclusions can we then draw about linear regression?