IEOR Discussion 8: Introduction to Hypothesis Testing and $p$-Values¶

So far in the course we have discussed problems of estimation and detection, in which we try to identify a model using a set of data. However, outside of cross validation heuristics, we have yet to describe how to evaluate these models and how to use them for decision making. The area of statistics which will help us with these tasks is known as hypothesis testing . In this discussion we will consider null hypothesis testing, which is a testing paradigm that can be summarized as:

Make a base assumption about the system (this is known as the null hypothesis sometimes written as $H_0$)
Collect and measure data from the system
Compute probability (likelihood) of observing the measurments or more extreme values under the null hypothesis (this is known as the $p$-value)
Make a decision of either accepting the null hypothesis, rejecting the null hypothesis, or collecting additional data

It is important to remember that while step 4 implies we are in practice making some decision on what is true and what isn't, the framework of hypothesis testing cannot mathematically show us was is true or the likelihood of $H_0$ being true. Instead this procedure is meant to atempt to invalidate some base assumption about the system given the likelihood we measure the observed data under $H_0$. In this discussion we will consider a few examples of this procedure.

Ex1: Testing for Efficacy of a Treatment¶

Suppose we are running a clinical trial and testing the effectiveness of some new drug. At the begining of the trial we give 100 sick participants the drug and after a few weeks of administrating the treatment we check up on the patients. Patients which still show symptoms of the disease we mark as '0' (for failure to treat) and if they are assymptomatic at the check up we assign them the value '1' (or successfuly treatment). Let's use null hypothesis testing to analyze this experiment.

Before looking at our data we need to state which assumption about the system we want to invalidate (our null hypothesis). What is a reasonable null hypothesis for this setting?

Since we are considdering the efficacy of the treatment a reasonable null hypothesis to consider would be: $$ H_0 := \{\text{The treatment is no more effective then recovery by random chance}\}$$

Note that to test this hypothesis we would need to have some information about the chance an untreated patient generally has at recovering from the disease. In the real world this information is gotten by using a control group of untreated patients in the experiment, which is a more complex experimental design which we will consider later in the course. For now let us suppose that for this particular disease it is known that patients generally have a %30 chance of being assymptomatic without any treatment.

Using this ideas our model for the system can be viewed as 100 i.i.d random variables $X_i \sim \text{Bernouli}(p)$ where our null hypothesis corresponds to: $$H_0: p = 0.3$$

Now let us examine the data collected from the trial:

import pandas as pd
clinical_trial_data = pd.read_csv('clinical_trial_data.csv')

clinical_trial_data.iloc[:20,:]

Using this data we will now compute how likely it was to be generated by a system which follows our null hypothesis. Note we have 100 i.i.d Bernouli trials and our null hypothesis states that these should have a success probability of %30. Using this information under $H_0$ the total number of successfuly treated patients $S_{100}$ should have a binomial distribution, and the pmf of $S_{100}$ is given by: $$\Pr(S_{100} = s) = \binom{100}{s}(0.3)^s(0.7)^{100-s} $$

Since $S_{100}$ is a function of our data which we are using to determine the outcome of the test we call it the test statistic . In general we will construct these statistics so we can perform analysis on them reliably. First let's compute the value of the test statistic:

total_success = clinical_trial_data['Trial_Results'].sum()
print('The total number of successfuly treated patients is ' + str(total_success))

The total number of successfuly treated patients is 60

Now we will compute the $p$-value of the test, or the liklihood we would observe an outcome as or more extreme then 60 success under $H_0$. Using our distribution for $S_{100}$ we see that the $p$ value will be: $$p = \sum_{s=60}^{100} \binom{100}{s}(0.3)^s(0.7)^{100-s} $$

Below we have the code to compute the $p$-value:

from scipy import misc

p = 0.0 

for s in range(60,101):
    p += float(misc.comb(100,s))*(0.3**s)*(0.7**(100-s))

print('The p-value is: '+ str(p))

The p-value is: 5.129949815583132e-10

We observe that the $p$-value is extremely low for this test, what sort of decision should we take regarding the data given this information? (accept $H_0$, reject $H_0$, or collect more data?)

Ex2: Testing the Mean of Normal Distributions with known variance¶

Suppose we are interested in intercepting some radio transmission signal being sent by an enemy spy using some sort of listening device looking for changes in the amplitude of the waves. From previous calibration we know that on average the measurement error in our instrument and background radiation acitivity in our vecinity is roughly distributed $\mathcal{N}(0,1)$, and we know if someone is transmittig something there should be some sort of deviation from this. Let's construct a null hypothesis test to determine if a message has been transmitted. First what should our null hypothesis be?

A potnetially good candidate for the null hypothesis would be: $$H_0 := \{\text{There is no signal being transmitted}\} $$

Since from our previous measurments we've assertained that background noise should be $\mathcal{N}(0,1)$, then this is the same as saying our readings $X_i \sim \mathcal{N}(\mu,1)$ and we can write $H_0$ as: $$H_0 := \{\mu = 0\}$$

Suppose we've taken $30$ readings over the course of our listening session and obtained the following data:

import pandas as pd
radio_data = pd.read_csv('radio_readings.csv')

radio_data.iloc[:20,:]

Our next task is to construct a test statistic and calculate a $p$-value for our test. Since our model has i.i.d. normally distributed variables we know that under $H_0$ the sample average $\bar{X} \sim \mathcal{N}(0,\frac{1}{30})$ as well, so we will use it as our test statistic. Now let's compute the sample average:

x_bar = radio_data['Readings'].mean()
print('The sample average is: ' + str(x_bar))

The sample average is: 1.035

Using this average we will now compute the $p$-value of our test. Note that since our statistic is a standard normal distribution, the $p$-value will be given by: $$ p = 2(1 - \Phi(\bar{X}\sqrt{30}))$$

Where $\Phi(\cdot)$ is the cdf of the standard normal distribution. The reason we are multiplying by 2 is that the normal distribution is symmetric, and we are interested in checking for deviations above and below zero. This is what is known as a two tailed test, checking for deviations in only one direction as we did in the previous example is known as a one tailed test. So let's compute our $p$-value:

from scipy.stats import norm
import numpy as np
p = 2.0*(1.0 - norm.cdf(x_bar*np.sqrt(30.0)))

print('the p-value of the test is ' + str(p))

the p-value of the test is 1.43693337229e-08

Based on the $p$-value of this test, what sort of decision should we take regarding the data given this information? (accept $H_0$, reject $H_0$, or collect more data?)

Ex3: Testing the value of an Exponential Interarival Rate¶

Suppose we are measuring the service times of customers using our online store (in minutes), and we are interested in seeing if adding anew feature to our website is changing user behvair. Prior to adding this feature we know that on average a user tood 1min of time to complete a purchase, and so we want to make sure our new design does not cause users to spend segnificantly more time on our site with this new added feature. We will use null hypothesis testing to test this problem. In this particular case what should be our null hypothesis?

A relevant choice of null hypothesis is one of the form: $$H_0 = \{\text{Service time is 1min} \} $$

Since we are measureing service times, one common model for this type of data is using i.i.d. Exponential random variables $X_i \sim \text{Exp}(\lambda)$ since this implies a poisson arrival process which has many other useful properties. Using this model we can state $H_0$ as: $$H_0 := \{\lambda = 1 \} $$

Suppose we have the following readings from 25 visitors to our website:

site_arrivals = pd.read_csv('visit_data.csv')
site_arrivals.iloc[:20,:]

Next we will calculate a test statistic and our $p$-value for the test. Since under our null hypothesis we have i.i.d exponential random variables we know that taking the sum of these values $S_{25}$ will have an Erlang(1,25) distribution with the following pdf: $$f(s) = \frac{s^{24}e^{-s}}{24!} $$

So using these fact we let $S_{25}$ be our test statistic and we compute it as follows:

s = site_arrivals['Service_Times'].sum()
print("The sum of the service times is: " + str(s))

The sum of the service times is: 52.49

Since we are interested in checking values higher then this we can perform a single tailed test and calculate our $p$-value as:

$$p = 1- \frac{\gamma(25,S_{25})}{24!}$$

Where the seccond term corresponds the cdf of an Erlang(1,25) random variable. Using python code we obtain:

from scipy.stats import gamma
p = 1 - gamma.cdf(25,s)

print('The p-value for our test is: ' + str(p))

The p-value for our test is: 0.999998908687

Based on the $p$-value of this test, what sort of decision should we take regarding the data given this information? (accept $H_0$, reject $H_0$, or collect more data?)

	Trial_Results
0	0
1	1
2	0
3	0
4	0
5	1
6	0
7	1
8	1
9	1
10	0
11	1
12	0
13	0
14	1
15	1
16	1
17	1
18	1
19	0

	Readings
0	1.05
1	1.50
2	0.00
3	1.69
4	0.58
5	-0.58
6	0.35
7	1.60
8	1.33
9	-0.15
10	1.62
11	0.91
12	1.43
13	1.33
14	-0.16
15	1.35
16	0.39
17	2.55
18	1.72
19	1.05

	Service_Times
0	5.12
1	1.19
2	5.40
3	1.88
4	1.79
5	0.37
6	5.61
7	0.01
8	0.44
9	0.86
10	2.27
11	0.33
12	2.98
13	6.15
14	0.27
15	1.36
16	0.01
17	0.73
18	0.07
19	4.71