IEOR Discussion 12: Null Hypothesis Testing for Distribution Assumptions

In this course we have considered various forms of point estimates, confidence intervals, and null hypothesis tests. However, whenever we performed our analysis we often assumed some form of underlying distribution (usually normal), and that samples were i.i.d. Just like assuming that the mean is $\mu_0$ these assumptions can also be viewed as null hypotheses and in this discussion we will look at some examples of how we can test for these sort of hypotheses.

Ex1:Testing If Features are Independet

Suppose we are forming a model which will attempt to predict the political preferences of prospective voters in a given city. For our model we collect information about what neighborhood the person lives in and what income bracket they fit into from a sample of 100 individuals living in this city. Suppose the city has 4 major neighborhoods (labled 0-3), and we can split the town's inhabitants into 6 incom brackets (labled 0-5). The data we collect from this survey is presented below:

In [1]:
import pandas as pd
election_data = pd.read_csv('election_data.csv')

election_data.iloc[:20]
Out[1]:
Income_Bracket Neighborhood
0 3 0
1 4 0
2 0 2
3 2 3
4 0 2
5 0 0
6 1 0
7 2 3
8 2 0
9 4 1
10 2 2
11 5 1
12 1 2
13 4 0
14 0 1
15 3 3
16 3 1
17 4 1
18 1 1
19 1 1

If we want to perform some sort of regression/classification analysis using these features we know that one key assumption that we need is that they are in fact independent. So let's test this using a null hypothesis test. In general the null hypothesis which we desire to analyze in this case is given by:

$$H_0 = \{\text{The features which we observe have independent distributions} \} $$

Note that in this case we have categorical data. In particular the income bracket feature can take one of 6 values while the neighborhood feature can take one of 4 different values. Let us therefore define two descrete random variables $X_B,X_N$, which will stand for the income bracket and neighborhood respectively. Then we know the pmf of these random variables will be given by:

$$p_i(x) = \mathbb{P}(X_i = x) = p_{i,x} $$

For $i \in \{B,N\}$. We know that if $X_B$ is independent of $X_N$ then their joint distribution is simply the product of their pmfs. That is:

$$\mathbb{P}(X_B =x, X_N = y) = \mathbb{P}(X_B = x)\mathbb{P}(X_N = y) = p_{B,x}p_{N,y} $$

So using this fact allows us to restate our null hypothesis as:

$$H_0 = \{\mathbb{P}(X_B =x, X_N = y) = \mathbb{P}(X_B = x)\mathbb{P}(X_N = y) \} $$

Now that we have phrased our null hypothesis in terms of a probabilistic formulation we can construct a test statistic to evaluate it. First we note that for every realization of $X_B,X_N$ we can estimate $p_{i,x}$ using the following mle/method of moments estimator:

$$\hat{p}_{B,x} = \sum_{j = 0}^3 \frac{n_{x,j}}{n} $$$$\hat{p}_{N,x} = \sum_{j = 0}^5 \frac{n_{j,x}}{n} $$

Here the number $n_{x,j}$ signifies the number of times the value $x$ has appeared with the value $j$. Essentially, we can think of this part of the procedure as computing an estimate of the marginal distribution of each of the features since we are looking at the relative fraction of the time a particular value is observed while averaging out the effect of the second feature. Using these estimates we define the following test statistic:

$$T = \sum_{i = 0}^4\sum_{j = 0}^6 \frac{n_{i,j}^2}{n\hat{p}_{N,i} \hat{p}_{B,j}} - n$$

This statistic essentially compares the amount of times we'd expect to see a particular pair of values with what we truely observed in our data. One important property of this statistic is also that it has a known distribution, in fact this is a $\chi^2$ with $(4-1)(6-1) = 15$ degrees of freedome. In general, if we are analyzing two categorical variables with $c_1$ and $c_2$ categories respectively then the statistic which will test for their independence will have $(c_1 - 1)(c_2 -1)$ degrees of freedome. Intuitively, this comes from the calculation of the intermediate estimates $\hat{p}$. Using this null hypothesis test let's see if neighborhood is independant of income bracket. First let's compute $\hat{p}_{B,x},\hat{p}_{N,x}$ and $n_{i,j}$

In [2]:
import numpy as np
data_array = election_data.as_matrix()

n_counts = np.zeros((6,4))

for ind in range(100):
    current_vals = data_array[ind,:]
    n_counts[current_vals[0],current_vals[1]] +=1
    
marginal_neighborhood = np.sum(n_counts,0)/100.0
marginal_income = np.sum(n_counts,1)/100.0

print('the marginal probability estimates for the neighborhood:')
print(marginal_neighborhood)


print('the marginal probability estimates for the income bracket:')
print(marginal_income)
the marginal probability estimates for the neighborhood:
[ 0.3   0.21  0.3   0.19]
the marginal probability estimates for the income bracket:
[ 0.12  0.18  0.25  0.14  0.21  0.1 ]

Now let's compute our test statistic $T$:

In [3]:
t_stat = 0

for ind_i in range(6):
    for ind_j in range(4):
        t_stat += (float(n_counts[ind_i,ind_j])**2)/(100.0*marginal_neighborhood[ind_j]*marginal_income[ind_i])
        
t_stat = t_stat - 100

print('our test statistic is: '+str(t_stat))
our test statistic is: 37.9416000318

All that is left to do is compute the $p$-value of our $\chi^2$ test this is achieved by using the formula:

$$p = \mathbb{P}(\chi^2(15) \geq T) $$

So using a $\chi^2$ tabel look up we can compute the following:

In [4]:
from scipy.stats import chi2

p = 1 - chi2.pdf(t_stat,15)

print('the p-value for our test is given by: '+ str(p))
the p-value for our test is given by: 0.999687046275

Based on these results is it reasonable to assume that these two features are independent of each other?

Ex2: Kolmogrov Smirnov Tests for Distribution

When we are considering none categorical data we need a different approach for testing whether or not our data fits a certain probabilisitic model. Usually, we will be concerned with checking if our data is actually normally distributed or not. We can pose this question as a different type of hypothesis test. Let's consider our example from noneparametric distribution estimation. Recall the following set of measurments with unknown continuous distribution:

In [5]:
data = [4.45559267,   6.74293422, -20.14864735,   3.49075864,   1.3224875, 
  -0.0674377,    2.03760975,   3.89168089,   4.30622691,   5.24243778, 
   4.47118085,   6.38819759,   2.31717906,   9.23465033,  -3.71355397, 
   6.25079944,  4.45762639,   5.3745804,    1.18938308,   2.22251412, 
   7.76006157,  13.27124009,   3.59884813,   6.45666895,   9.19240938, 
   9.67072606,  -0.31431053,  -2.64892822,   1.76057693,   9.23526739, 
   0.12167676,   4.48484103,  12.4229351,   5.2058985,    6.45232824, 
   3.61880038,   6.40043509,   8.31918962,  -4.92504348,   7.08117382, 
  16.41249085,   7.05750976,   3.2652978,    7.59222402,   0.26693817, 
   4.66984232,  10.09794026,   3.40297516,   3.34271601,   0.95943834]

Before starting to analyze this data we may want to check if it is truely normally distributed. This will form the basis of our null hypothesis which we can state as:

$$ H_0 = \{ \text{the observed data follows a normal distribution} \}$$

In particular let's suppose that the particular normal distribution which generated our data has the mean equal to our sample average and variance equal to our sample variance. That is this data which we observed is i.i.d. with distribution $\mathcal{N}(\bar{X},s^2)$. Since the question is about the distribution itself, we cannot use our parametric location tests to test our hypothesis, instead we need to consider a different feature of the data. As we observed in distribution estimation, one useful statistic of the data is the emperical cdf which we can calculate as:

$$\hat{F}(x) = \frac{1}{n}\sum_{i=1}^n \mathbb{1}\{x_i \leq x \} $$

Using this method we previously computed this emperical cdf as follows:

In [6]:
import matplotlib.pyplot as plt
%matplotlib inline
data = np.array(data)
data.sort

emperical_cdf_x = np.linspace(-30,30,num=100)
emperical_cdf = np.zeros(100)

for ind in range(100):
    val = emperical_cdf_x[ind]
    for observation in data:
        if observation <= val:
            emperical_cdf[ind] += 1.0/50.0

plt.figure()
plt.plot(emperical_cdf_x,emperical_cdf)
Out[6]:
[<matplotlib.lines.Line2D at 0xb1e4b00>]

If our data were normally distributed with our stated mean and variance, then this curve should look similar to the cdf curve of our supposed model. We can plot both of these on the same set of axes to see if this is the case:

In [7]:
from scipy.stats import norm

sample_average = np.mean(data)
sample_variance = np.var(data)

null_cdf = []
for ind in range(100):
    null_cdf.append(norm.cdf(emperical_cdf_x[ind],loc=sample_average,scale=np.sqrt(sample_variance)))
    
plt.figure()
plt.plot(emperical_cdf_x,emperical_cdf,'-.',emperical_cdf_x,null_cdf,'r--')
Out[7]:
[<matplotlib.lines.Line2D at 0xb28c588>,
 <matplotlib.lines.Line2D at 0xb28c748>]

Looking at these curves it seems that our data could be normally distributed but it is unclear. Somehow we need to adress the fact that we aren't comparing single points but we are comparing entire functions. We can use this however as a starting point for forming our test statistic. Lets define our test statistic $T$ as the largest distance between these two curves. Using mathematical notation we can write this as:

$$T = \sup_x |F(x) - \hat{F}(x)| $$

Here $F$ corresponds to the normal cdf curve and $\hat{F}$ is the emperical cdf. Intuitively this is a useful measurment for how close the distributions are since if the largest separation between these two curves is small then it is likely that they are actually the same distribution. It has been shown that the scaled statistic $\sqrt{n}T$ in fact has a known distribution which is known as the kolmogorov distribution, and hence forms the basis of the kolmogorov hypothesis test. Much like the pivots we formed for parametric tests, the kolmogorv distribution does not depend on the underlying distribution of the null hypothesis. Using this fact we can compute the following p value for this test:

$$p = \mathbb{P}(\sqrt{n}T \leq K) $$

Here $K$ follows the kolmogorov distribution. Using code we can compute the $p$-value as follows:

In [8]:
from scipy.stats import ksone
null_cdf = np.array(null_cdf)

differences = np.abs(null_cdf  - emperical_cdf)
kol_stat = np.amax(differences)

p = ksone.cdf(kol_stat,50.0)
print('the p-value of the kolmogorov test is: '+str(p))
the p-value of the kolmogorov test is: 0.724583250746

Evaluating these results can we say our data is normally distributed?