IEOR Discussion 9: One Sample Z and T Tests

Last time in discussion we considered the basics of null hypothesis testing and how we can use these methods to gain insights about our models. In this discussion we will look at a few more examples of hypothesis testing and specifically Z and T tests. Recall that when we do hypothesis testing we generally follow the following steps:

  1. Make a base assumption about the system (this is known as the null hypothesis sometimes written as $H_0$)
  2. Collect and measure data from the system
  3. Compute probability (likelihood) of observing the measurments or more extreme values under the null hypothesis (this is known as the $p$-value)
  4. Make a decision of either accepting the null hypothesis, rejecting the null hypothesis, or collecting additional data (based on some segnificance level $\alpha$)

In the examples we saw previously and in lecture we noted that the way we state our hypothesis depends on the distributoin model of our data as well as the question we want to analyze. In this discussion we will consider what are known as location tests , or hypothesis tests were the question of interest is not about the variance of the data but rather about the mean or median.

Ex1: One Sample Z Test

Suppose we are interested in analyzing effectiveness of an advertising campaign in terms of attracting users to our web service. Over the course of two weeks we collected the following data of user traffic to our site:

In [1]:
import pandas as pd
web_traffic_data = pd.read_csv('traffic_data.csv')

web_traffic_data
Out[1]:
Number_of_Visitors
0 1520
1 1700
2 1102
3 1777
4 1333
5 866
6 1241
7 1739
8 1633
9 1041
10 1747
11 1465
12 1670
13 1633

What we are interested in knowing is weather our advertising campaign has increased the number of users coming to our website. Suppose we know that prior to our advertising campaign we had on average 2000 visitors use our service each day, and that the standard deviation was around 400 vistors. What kind of test should we perform and what should our null hypothesis be?

We note that since we are only interested in finding out of the number of visitors has increased we want to conduct a single sided test. Morever we can write the null hypothesis as:

$$H_0 := \{\text{The average number of visitors coming to our website is no more than 2000}\} $$

Rejecting this null hypothesis would indicate that our campaign had a positive effect on user traffic. Note that failing to reject this null would not mean that the advertising campaign had adverse effects on traffic, to determine this fact we would need to conduct a separate analysis. Now we need to form a test statistic which we can use to accept or reject the null. Since we are interested in analyzing the average number of visitors the arithmetic mean of our data seem like a natural choice to use in this case.

In [2]:
x_bar = web_traffic_data['Number_of_Visitors'].mean()
print('The sample average is: ' + str(round(x_bar,2)))
The sample average is: 1461.93

Now using the sample average we want to analyze how likely it is that we observe this data. To do this we need to first stipulate some additional assumptions about how our data was obtained. One assumption that seems reasonable is that the number of visitors on each day is independant of the number of visitors on any other day and that the distribution remains the same. This essentially means that we have 14 i.i.d. observations with some unknown mean and known variance. Also note that by adding this assumption and applying the Central Limit theorem we know that our sample average $\bar{X}_{14}$ has a roughly normal distribution. Moervoer, using the law of large numbers we can approximate this distribution well with: $$\bar{X}_{14} \sim \mathcal{N}(\mu, 11428.57) $$

Here we computed the varince as $\sigma_{14}^2 = \frac{400^2}{14}$. Using this model we can restate our null hypothesis as: $$H_0 = \{\mu \leq 2000 \} $$

Since we have normal distribution with a known variance and unknown mean we can use what is known as a Z-test. The name comes from the common notation for a standard normal random variable (i.e. $\mathcal{N}(0,1)$) as the letter $Z$. If our sample average $\bar{X} \sim \mathcal{N}(\mu,\sigma^2)$ for some unkonw $\mu$ we can alway transform it into a standard normal by using the transformation: $$ Z = \frac{\bar{X} - \mu}{\sigma} $$

This is an extremely power technique since it allows us to take a statitic with an unknown distribution and turn it into a value with a known distribution. Since $Z$ is both a function of the data and the unknown parameter $\mu$ we call it a pivot , however sometime you will here this referred to as a statistic though this is not technically true. Now we can use this to compute our $p$-value. Since we are interested in a one-sided test we want to compute:

$$p = \mathbb{P}\big(Z \geq \frac{\bar{X}_{14} - 2000}{\sqrt{11428.57}}\big)$$

Using code we can do this as follow:

In [3]:
from scipy.stats import norm
import numpy as np
p = 1.0 - norm.cdf((x_bar-2000)/np.sqrt(11428.57))

print('the p-value of the test is ' + str(p))
the p-value of the test is 0.999999758817

Based on these results, what decision should we take with respect to our data? (accept $H_0$, reject $H_0$, collect more data?)

Ex2: One Sample T Test

Suppose we are conducting a medical trial to analyze the efficacy of a new fitness tracker which is supposed to increase physical activity. For our experiment we take 30 participants which wear the tracker, and measure the average amount of active time the paricipants have for the last 2 weeks of the trial in minutes. We have this data below:

In [4]:
activity_data = pd.read_csv('activity_data.csv')

activity_data.iloc[:20]
Out[4]:
Active_Time_min
0 13.75
1 19.16
2 12.04
3 44.60
4 6.90
5 7.37
6 27.54
7 1.32
8 4.13
9 6.36
10 28.27
11 54.38
12 20.62
13 3.23
14 28.09
15 11.06
16 19.71
17 37.63
18 8.78
19 20.14

We know that a person in the general population normally has an average of 10 minutes of active time a day. Using this knowledge what is a reasonable null hypothesis to use for analyzing this data set?

Since we want to show that our fitness tracker increases the amount of daily active minutes of our users one reasonable null hypothesis to consider is: $$H_0 = \{\text{the amount of active time of active time of users is no more than 10 minutes}\} $$

Rejecting this null would mean that our treatment is an improvement over the average, while failing to reject this null may mean that we have no added benefit or maby an adverse effect. Like before since we are interested in the average time, a useful test statistic is the sample average. We can compute this again using this code:

In [5]:
x_bar = activity_data['Active_Time_min'].mean()
print('The sample average is: ' + str(round(x_bar,2)))
The sample average is: 16.39

Like before we can use our knowledge of the central limit theorem to note that this sample average should be roughly normally distributed that is $\bar{X} \sim \mathcal{N}(\mu,\sigma^2)$. So assuming all the individual participant's exercise paterrns are independant of each other we can restate our null hypothesis as:

$$ H_0 = \{\mu \leq 10\} $$

However unlike before we do not know a priori what the variance of our data should be. We can estimate this quantity using the sample variance $s^2$ as follows:

In [7]:
s_squared = activity_data['Active_Time_min'].var()
print('The sample variance is: ' + str(round(s_squared,2)))
The sample variance is: 166.23

This however introduces a problem. Note that before we knew what our varriance was which made $\sigma^2$ a known constant and hence we could be confident that the pivot calculated by $Z = \frac{\bar{X}-\mu}{\sigma}$ was a standard normal variable. This is true since the only random variable in the right hand side of the equation is $\bar{X}$, which we know has a normal distribution and normality is preserved under linear transformations. Now though we would need to consider $\frac{\bar{X}-\mu}{s}$, and we have one random variable dividing another random variable. This new result is not normally distributed since devision by a random variable is not a linear transformation. However this sort of transformation has a known distribution which is called the Student's T distribution. In general if we have the sample mean $\bar{X}$ and sample variance $s^2$ of the data the pivot given by: $$ T = \frac{\bar{X}-\mu}{s}$$

follows this distribution. Note again, that we have maid a transformation from two random variables which have unknown distributions into a random variable which has a well known distribution we can use to analyze our data. Following the naming convention of hypothesis tests, a test where we are interested in analyzing the mean of our data and do not know the varriance is known as a T-Test. Much like in the case of the Z-Test, many people refer to the analysis object as the T-statistic but it is technically a pivot and not a statistic. In general the T distribution has only one paramter which is notated as $\nu$. This parameter is called the degrees of freedom of the distribution, which we can think of as the number of values which can vary when we calculate the test statistic. So for instance we consider only the sample mean $\bar{X} = \frac{1}{n}\sum_{i=1}^n X_i$, since the mean can be anything there are no constraints on the $X_i$ so we have $n$ freely moving values and thus $n$ degrees of freedome. However, when we calculate the sample variance $s^2$ we are considering the diviations in our data which we can write as $Y_i = X_i - \bar{X}$. Note that these values are constrained since $\sum_{i=1}^n Y_i =0$, so this means that if someone told us the values of $n-1$ of these variables we would automatically know the value of the remaining $Y_i$. Hence, only $n-1$ values are allowed to vary freely during the varriance computaiton so we only have $n-1$ degrees of freedom. A generally useful notion when considering how many degrees of freedom a statistic has is given by the number of independant data points considered minus the number of intermidiate estimated parameters. Since the T pivot requires us to use the variance estimate, we note that in general with $n$ data points, the resulting t-distribution will have $n-1$ degrees of freedom.

So now we can proceed to compute our $p$-value, since we are conducting a onesided test this will be given by:

$$p = \mathbb{P}\big(T \geq \sqrt{30}\frac{16.39-10}{\sqrt{166.23}} \big)$$

Where T has a Student's t-distribution with 29 degrees of freedom. We can compute this value using the following code:

In [8]:
from scipy.stats import t
import numpy as np
p = 1.0 - t.cdf(np.sqrt(30)*((x_bar-10)/np.sqrt(166.23)),29)

print('the p-value of the test is ' + str(p))
the p-value of the test is 0.00554523565435

Using a confidence level 0.05, what sort of decision should we take with respect to our test? (accept $H_0$, reject $H_0$, collect more data?)