IEOR 165 Discussion 2: Linear Regression Diagnostics¶

Last discussion we went over the method of moments and an example of multiple regression with a real world data set. In this discussion we will analyze a data set using diagnostic tools from linear regression.

Before we begin just a reminder:

Name: Yonatan Mintz
Discussion Sessions (choose either one):
- Section 1: F 10-11A, 310 Hearst Memorial Mining Building
- Section 2: W 3-4P, 240 Bechtel
Office Hours: T/Th 3-4PM Etcheverry 4176B

Recall that in linear regression, we are given data of the form $(x_i,y_i)$ and are interested in finding a linear relationship of the form: $$y_i = \beta_1 \cdot x_i + \beta_0 + \epsilon_i $$

Where $\epsilon_i$ are i.i.d random variables with $\mathbb{E}[\epsilon_i] = 0$ and $\mathbb{E}[\epsilon_i^2] < \infty$. These assumptions are reasonable for many models, esspecially since many quantities of interest in real world applications are bounded. When $x_i$ and $\beta_1$ are vectors then the model is known as multiple linear regression. In this discussion we will conduct a multiple linear regression analysis on a data set.

Linear Regression Example 1: Movment of Object¶

Consider the following data measurments taken every second of the location of an object. We can think of these as measurments taken for instance from a car's location based on GPS or other location services or the location of a drone as measured from sensors. In these cases we know that there is some measurment noise in what we are observing so we need to analyze the data to find the true location of our target. Here are the observations below:

import pandas as pd
location_data = pd.read_csv('object_movement_data.csv')
#location_data.set_index('time',inplace='True')
print(location_data.head(20))

    time  location_of_object
0      0           26.243454
1      1           13.632436
2      2           23.718282
3      3           27.020314
4      4           54.654076
5      5           30.734613
6      6           78.448118
7      7           60.137931
8      8           77.190391
9      9           77.256296
10    10           99.621079
11    11           69.148593
12    12           90.775828
13    13           93.909456
14    14          112.337694
15    15           92.751087
16    16          104.275718
17    17           98.971416
18    18          109.422138
19    19          115.578152

Looking at this data what should we use as our predictor and what should be our response variable?

Using linear regression the sort of model we would be interested in is of the form: $$\text{location_of_object}_i = \text{time}_i\cdot\beta_1 + \beta_0 + \epsilon $$ So using our tools from before let's calculate these coefficients using OLS:

import statsmodels.formula.api as sm
import numpy as np
location_data['intercept'] = np.ones(( len(location_data), ))
X = location_data[['time','intercept']][:-1]
Y = location_data['location_of_object'][:-1]
X = pd.DataFrame.as_matrix(X)
Y = pd.DataFrame.as_matrix(Y)
result = sm.OLS(Y,X).fit()
result.summary()

Let us evaluate our regression. Does this model seem like a good fit based on this diagnostic?

One way we can get more insights into our model is to graph our estimated model against the data. If we do so we obtain the following:

import matplotlib.pyplot as plt
%matplotlib inline
plt.figure()
Y_hat = -1.9282*X[:,0] + 101.8905
plt.plot(X[:,0],Y,'o',X[:,0],Y_hat,'-r')

[<matplotlib.lines.Line2D at 0xbbdd358>,
 <matplotlib.lines.Line2D at 0xbbdd518>]

Another useful diagnostic of our model would be to plot the residuals, which should give as an estimate of our noise $\epsilon$. Let us look at the residual plot:

%matplotlib inline
plt.figure()
plt.plot(X[:,0],Y-Y_hat,'o')

[<matplotlib.lines.Line2D at 0xbc50b70>]

Does this residual plot satisfy our assumptions on the model? How can we improve our analysis?

One powerful aspect of linear regression is that models don't need to be linear in the data but only in the coeffecients. This is a very powerful modeling tool since it allows us to express a wide variety of functions. For instance we can consider a polynomial of the form: $$y_i = \sum\limits_{k=1}^N \beta_k\cdot x^k + \beta_0 + \epsilon_i $$

Can we use this to improve our own analysis? How should we transform our data?

Based on our plots the data seems to be parabolic so a model we could consider is of the form: $$\text{location_of_object}_i = \text{time}_i\cdot\beta_1 + \text{time}^2_i\cdot\beta_2 + \beta_0 + \epsilon $$

We can solve for these coefficients using the same OLS solver we used previously. This can be done by adding one more collumn to our data set, this time corresponding to $\text{time}^2$

%matplotlib inline
location_data['time_squared'] = location_data['time']**2
location_data['intercept'] = np.ones(( len(location_data), ))

X = location_data[['time','time_squared','intercept']][:-1]
Y = location_data['location_of_object'][:-1]
X = pd.DataFrame.as_matrix(X)
Y = pd.DataFrame.as_matrix(Y)
result = sm.OLS(Y,X).fit()
result.summary()

How do the results of this regression compare to our previous regression?

Now let's consider the residual plot and fit prlot of our new regression model, how do these compare to our previous analysis?

%matplotlib inline
Y_hat = 9.7049*X[:,0] -0.2424*X[:,1] + 10.7642
plt.plot(X[:,0],Y,'o',X[:,0],Y_hat,'-r')
plt.figure()
plt.plot(X[:,0],Y-Y_hat,'o')

[<matplotlib.lines.Line2D at 0xc1314a8>]

Linear Regression Example 2: Using Multiple Regression to Estimate Demand of a Product¶

Suppose we are interested in forcasting the demand for a certain product given the following marketing data from previous similar products about price level and amount of money spent on an advertising campaign.

marketing_data = pd.read_csv('marketing_data.csv')
#location_data.set_index('time',inplace='True')
print(marketing_data.head(20))

    ad_buy  price       sales
0       58   53.7  445.927030
1      191   73.5  530.612059
2      165   89.6  536.116398
3      171   57.2  503.308019
4       80   75.5  485.009633
5      121   89.3  511.427117
6      181   70.3  525.762721
7      199   63.3  495.033582
8       99   83.5  482.302370
9      107   94.8  525.598761
10      53   64.4  465.249682
11      74   62.9  464.065529
12      93   96.0  509.740927
13     126   57.1  463.002748
14      76   73.7  457.823505
15     102   89.0  516.192738
16     130   78.1  496.953603
17     159   67.8  516.716043
18     165   77.6  525.624006
19      91   75.4  484.073486

Looking at this data which variables should we use a predictors and which variable should be our response?

As we saw in our previous example, visualizing the data with a scatter plot can be invaluable for us to detrmine what kind of model we want to fit. Lets consider the following two scatter plots of our data.

%matplotlib inline
ad_buy_plot = plt.scatter(marketing_data['ad_buy'],marketing_data['sales'])
plt.xlabel('ad_buy')
plt.ylabel('sales')

plt.figure()
plt.scatter(marketing_data['price'],marketing_data['sales'])
plt.xlabel('price')
plt.ylabel('sales')

<matplotlib.text.Text at 0xc1c8208>

These residuals look some what correlated, this seems like both of these predictors would be useful to estimate our response. Therefore, the model we want to estimate with OLS is of the form: $$\text{sales}_i = \beta_2 \cdot \text{ad_buy}_i + \beta_1 \cdot \text{price}_i + \beta_0 + \epsilon_i $$

Runing our OLS we obtain the following regression results:

marketing_data['intercept'] = np.ones(( len(location_data), ))

X = marketing_data[['ad_buy','price','intercept']][:-1]
Y = marketing_data['sales'][:-1]
X = pd.DataFrame.as_matrix(X)
Y = pd.DataFrame.as_matrix(Y)
result = sm.OLS(Y,X).fit()
result.summary()

Looking at the regression results, what do you think is the quality of our fit? Is this surprising?

Interestingly enough the actual model used to generate this data is of the following form: $$\text{sales}_i = \beta_2 \cdot \text{ad_buy}_i + \beta_1 \cdot \ln(\text{price}_i) + \beta_0 + \epsilon_i $$

Like we did with the polynomial, we can again solve for these coeffecients using OLS:

marketing_data['log_price'] = np.log(marketing_data['price'])

X = marketing_data[['ad_buy','log_price','intercept']][:-1]
Y = marketing_data['sales'][:-1]
X = pd.DataFrame.as_matrix(X)
Y = pd.DataFrame.as_matrix(Y)
result = sm.OLS(Y,X).fit()
result.summary()

Compare these regression results to the previous model, do we have a segnificantly better fit? What conclusions can we then draw about linear regression?

Dep. Variable:	y	R-squared:	0.274
Model:	OLS	Adj. R-squared:	0.259
Method:	Least Squares	F-statistic:	17.75
Date:	Fri, 03 Feb 2017	Prob (F-statistic):	0.000113
Time:	10:22:33	Log-Likelihood:	-255.37
No. Observations:	49	AIC:	514.7
Df Residuals:	47	BIC:	518.5
Df Model:	1
Covariance Type:	nonrobust

	coef	std err	t	P>\|t\|	[95.0% Conf. Int.]
x1	-1.9282	0.458	-4.213	0.000	-2.849 -1.007
const	101.8905	12.751	7.991	0.000	76.239 127.542

Omnibus:	6.587	Durbin-Watson:	0.134
Prob(Omnibus):	0.037	Jarque-Bera (JB):	4.236
Skew:	-0.553	Prob(JB):	0.120
Kurtosis:	2.076	Cond. No.	54.9

Dep. Variable:	y	R-squared:	0.966
Model:	OLS	Adj. R-squared:	0.965
Method:	Least Squares	F-statistic:	654.0
Date:	Fri, 03 Feb 2017	Prob (F-statistic):	1.65e-34
Time:	10:27:23	Log-Likelihood:	-180.36
No. Observations:	49	AIC:	366.7
Df Residuals:	46	BIC:	372.4
Df Model:	2
Covariance Type:	nonrobust

	coef	std err	t	P>\|t\|	[95.0% Conf. Int.]
x1	9.7049	0.393	24.693	0.000	8.914 10.496
x2	-0.2424	0.008	-30.608	0.000	-0.258 -0.226
const	10.7642	4.079	2.639	0.011	2.554 18.975

Omnibus:	0.850	Durbin-Watson:	2.456
Prob(Omnibus):	0.654	Jarque-Bera (JB):	0.896
Skew:	0.182	Prob(JB):	0.639
Kurtosis:	2.447	Cond. No.	3.03e+03

Dep. Variable:	y	R-squared:	0.915
Model:	OLS	Adj. R-squared:	0.911
Method:	Least Squares	F-statistic:	247.7
Date:	Fri, 03 Feb 2017	Prob (F-statistic):	2.36e-25
Time:	10:37:51	Log-Likelihood:	-176.56
No. Observations:	49	AIC:	359.1
Df Residuals:	46	BIC:	364.8
Df Model:	2
Covariance Type:	nonrobust

Omnibus:	1.139	Durbin-Watson:	1.812
Prob(Omnibus):	0.566	Jarque-Bera (JB):	0.970
Skew:	-0.096	Prob(JB):	0.616
Kurtosis:	2.338	Cond. No.	871.

Dep. Variable:	y	R-squared:	0.916
Model:	OLS	Adj. R-squared:	0.913
Method:	Least Squares	F-statistic:	251.9
Date:	Fri, 03 Feb 2017	Prob (F-statistic):	1.66e-25
Time:	10:39:13	Log-Likelihood:	-176.18
No. Observations:	49	AIC:	358.4
Df Residuals:	46	BIC:	364.0
Df Model:	2
Covariance Type:	nonrobust

Omnibus:	0.754	Durbin-Watson:	1.920
Prob(Omnibus):	0.686	Jarque-Bera (JB):	0.840
Skew:	-0.189	Prob(JB):	0.657
Kurtosis:	2.482	Cond. No.	2.88e+03

	coef	std err	t	P>\|t\|	[95.0% Conf. Int.]
x1	0.4345	0.029	14.787	0.000	0.375 0.494
x2	1.3639	0.089	15.319	0.000	1.185 1.543
const	338.8541	7.607	44.547	0.000	323.543 354.166

	coef	std err	t	P>\|t\|	[95.0% Conf. Int.]
x1	0.4265	0.029	14.605	0.000	0.368 0.485
x2	100.2894	6.487	15.460	0.000	87.232 113.347
const	11.1953	27.973	0.400	0.691	-45.112 67.503