Last discussion we went over the method of moments and an example of multiple regression with a real world data set. In this discussion we will analyze a data set using diagnostic tools from linear regression.
Before we begin just a reminder:
Recall that in linear regression, we are given data of the form $(x_i,y_i)$ and are interested in finding a linear relationship of the form: $$y_i = \beta_1 \cdot x_i + \beta_0 + \epsilon_i $$
Where $\epsilon_i$ are i.i.d random variables with $\mathbb{E}[\epsilon_i] = 0$ and $\mathbb{E}[\epsilon_i^2] < \infty$. These assumptions are reasonable for many models, esspecially since many quantities of interest in real world applications are bounded. When $x_i$ and $\beta_1$ are vectors then the model is known as multiple linear regression. In this discussion we will conduct a multiple linear regression analysis on a data set.
Consider the following data measurments taken every second of the location of an object. We can think of these as measurments taken for instance from a car's location based on GPS or other location services or the location of a drone as measured from sensors. In these cases we know that there is some measurment noise in what we are observing so we need to analyze the data to find the true location of our target. Here are the observations below:
import pandas as pd
location_data = pd.read_csv('object_movement_data.csv')
#location_data.set_index('time',inplace='True')
print(location_data.head(20))
Looking at this data what should we use as our predictor and what should be our response variable?
Using linear regression the sort of model we would be interested in is of the form: $$\text{location_of_object}_i = \text{time}_i\cdot\beta_1 + \beta_0 + \epsilon $$ So using our tools from before let's calculate these coefficients using OLS:
import statsmodels.formula.api as sm
import numpy as np
location_data['intercept'] = np.ones(( len(location_data), ))
X = location_data[['time','intercept']][:-1]
Y = location_data['location_of_object'][:-1]
X = pd.DataFrame.as_matrix(X)
Y = pd.DataFrame.as_matrix(Y)
result = sm.OLS(Y,X).fit()
result.summary()
Let us evaluate our regression. Does this model seem like a good fit based on this diagnostic?
One way we can get more insights into our model is to graph our estimated model against the data. If we do so we obtain the following:
import matplotlib.pyplot as plt
%matplotlib inline
plt.figure()
Y_hat = -1.9282*X[:,0] + 101.8905
plt.plot(X[:,0],Y,'o',X[:,0],Y_hat,'-r')
Another useful diagnostic of our model would be to plot the residuals, which should give as an estimate of our noise $\epsilon$. Let us look at the residual plot:
%matplotlib inline
plt.figure()
plt.plot(X[:,0],Y-Y_hat,'o')
Does this residual plot satisfy our assumptions on the model? How can we improve our analysis?
One powerful aspect of linear regression is that models don't need to be linear in the data but only in the coeffecients. This is a very powerful modeling tool since it allows us to express a wide variety of functions. For instance we can consider a polynomial of the form: $$y_i = \sum\limits_{k=1}^N \beta_k\cdot x^k + \beta_0 + \epsilon_i $$
Can we use this to improve our own analysis? How should we transform our data?
Based on our plots the data seems to be parabolic so a model we could consider is of the form: $$\text{location_of_object}_i = \text{time}_i\cdot\beta_1 + \text{time}^2_i\cdot\beta_2 + \beta_0 + \epsilon $$
We can solve for these coefficients using the same OLS solver we used previously. This can be done by adding one more collumn to our data set, this time corresponding to $\text{time}^2$
%matplotlib inline
location_data['time_squared'] = location_data['time']**2
location_data['intercept'] = np.ones(( len(location_data), ))
X = location_data[['time','time_squared','intercept']][:-1]
Y = location_data['location_of_object'][:-1]
X = pd.DataFrame.as_matrix(X)
Y = pd.DataFrame.as_matrix(Y)
result = sm.OLS(Y,X).fit()
result.summary()
How do the results of this regression compare to our previous regression?
Now let's consider the residual plot and fit prlot of our new regression model, how do these compare to our previous analysis?
%matplotlib inline
Y_hat = 9.7049*X[:,0] -0.2424*X[:,1] + 10.7642
plt.plot(X[:,0],Y,'o',X[:,0],Y_hat,'-r')
plt.figure()
plt.plot(X[:,0],Y-Y_hat,'o')
Suppose we are interested in forcasting the demand for a certain product given the following marketing data from previous similar products about price level and amount of money spent on an advertising campaign.
marketing_data = pd.read_csv('marketing_data.csv')
#location_data.set_index('time',inplace='True')
print(marketing_data.head(20))
Looking at this data which variables should we use a predictors and which variable should be our response?
As we saw in our previous example, visualizing the data with a scatter plot can be invaluable for us to detrmine what kind of model we want to fit. Lets consider the following two scatter plots of our data.
%matplotlib inline
ad_buy_plot = plt.scatter(marketing_data['ad_buy'],marketing_data['sales'])
plt.xlabel('ad_buy')
plt.ylabel('sales')
plt.figure()
plt.scatter(marketing_data['price'],marketing_data['sales'])
plt.xlabel('price')
plt.ylabel('sales')
These residuals look some what correlated, this seems like both of these predictors would be useful to estimate our response. Therefore, the model we want to estimate with OLS is of the form: $$\text{sales}_i = \beta_2 \cdot \text{ad_buy}_i + \beta_1 \cdot \text{price}_i + \beta_0 + \epsilon_i $$
Runing our OLS we obtain the following regression results:
marketing_data['intercept'] = np.ones(( len(location_data), ))
X = marketing_data[['ad_buy','price','intercept']][:-1]
Y = marketing_data['sales'][:-1]
X = pd.DataFrame.as_matrix(X)
Y = pd.DataFrame.as_matrix(Y)
result = sm.OLS(Y,X).fit()
result.summary()
Looking at the regression results, what do you think is the quality of our fit? Is this surprising?
Interestingly enough the actual model used to generate this data is of the following form: $$\text{sales}_i = \beta_2 \cdot \text{ad_buy}_i + \beta_1 \cdot \ln(\text{price}_i) + \beta_0 + \epsilon_i $$
Like we did with the polynomial, we can again solve for these coeffecients using OLS:
marketing_data['log_price'] = np.log(marketing_data['price'])
X = marketing_data[['ad_buy','log_price','intercept']][:-1]
Y = marketing_data['sales'][:-1]
X = pd.DataFrame.as_matrix(X)
Y = pd.DataFrame.as_matrix(Y)
result = sm.OLS(Y,X).fit()
result.summary()
Compare these regression results to the previous model, do we have a segnificantly better fit? What conclusions can we then draw about linear regression?