IEOR 165 Discussion 10: 2-Sample Tests

Last discussion we considered conducting 1-sample Z and T tests (depending on our model assumptions). Recall that in a 1-sample test we collect samples from a single population and try to discern if our assumptions about this population are true using a null hypothesis testing framework. Recall that our experimental framework will be:

  1. Make a base assumption about the system (this is known as the null hypothesis sometimes written as $H_0$)
  2. Collect and measure data from the system
  3. Compute probability (likelihood) of observing the measurments or more extreme values under the null hypothesis (this is known as the $p$-value)
  4. Make a decision of either accepting the null hypothesis, rejecting the null hypothesis, or collecting additional data (based on some segnificance level $\alpha$)

In this discussion we will look at how to apply this framework for comparissons between two populations known as 2-sample tests.

Ex1: 2-Sample Testing

2-Sample testing is a very powerful tool which we can use to compare either two different populations or the same population at different points in time. Let's consider when we would use these analyses by examining a clinical trial. Suppose we have 60 participants in a clinical trial to improve physicial activity using a new fitness tracker. 30 of them were in a treatment group which recieved the new tracker and the other 30 were part of a control group given the previous model. Each group was first measured for two weeks in the begining of the trial in a "run-in" period to obtain a baseline, and then measured again six months later for two weeks to see the effect of the different treatments. In this scenario we are interested in considering two different effects, can the new tracker segnificantly increase the level of exercise, and is it more effective then the previous model? To answer both of these questions we consider the following data set:

In [2]:
import pandas as pd
trial_data = pd.read_csv('trial_data.csv')
trial_data.set_index('Patient_Number',inplace='True')
trial_data.iloc[:20]
Out[2]:
Control_Base Control_End Treatment_Base Treatment_End
Patient_Number
0 15.96 23.73 13.75 18.71
1 53.47 26.49 19.16 23.02
2 16.52 21.06 12.04 24.41
3 21.69 39.10 44.60 29.89
4 25.56 8.07 6.90 5.60
5 40.39 33.61 7.37 25.69
6 27.53 23.71 27.54 14.83
7 7.34 14.74 1.32 3.41
8 20.00 35.60 4.13 32.86
9 28.14 17.15 6.36 36.03
10 15.30 35.42 28.27 15.20
11 31.57 23.79 54.38 37.64
12 8.02 1.00 20.62 19.28
13 45.97 24.50 3.23 26.00
14 42.02 13.44 28.09 8.52
15 14.96 56.78 11.06 48.77
16 29.17 45.53 19.71 14.89
17 20.72 24.61 37.63 23.63
18 7.56 7.21 8.78 35.43
19 21.32 9.25 20.14 5.50

This data set has our measurements for average active time of each patient. Note that patient number is only unique with in the respective study group, so Patient 0 in the Control group is different from Patient 0 in the treatment group. Let's start by analyzing the question of wether or not the new tracker can actually increase base activity level. To answer this question we need to compare the performance of the experimental group at the begging of the trial (i.e. during the run in period) vs the performance of the group at the end of the treatment. What is a null hypothesis which we could use to analyze this question?

One potential null hypothesis we could use is the following: $$ H_0 = \{\text{On average, the mean active time of experimental patients has not improved} \}$$

Since we are measuring the same population at two different time points, one reasonable model to assume is that the variance remains the same from the first time period to the next but only the mean shifts. Since we are looking at average activity times we can reasonably assume a normal distribution for the underlying variables. In abstract terms we can think of the average activity time of the experimental population at the baseline period $\bar{X}_b \sim \mathcal{N}(\mu_b,\frac{\sigma^2}{n})$, and the sample average of the experimental population at the end of the trial as $\bar{X}_f \sim \mathcal{N}(\mu_f,\frac{\sigma^2}{n})$. Using this model we can now rewrite our null hypothesis as:c $$ H_0 = \{\mu_f \leq \mu_b\}$$

Note that we do not know the true value of $\sigma^2$ a priori, so we will need to conduct some sort of $T$-test. Since we know that observation of a patient during the first time period is related to the observation of the same patient at the next time period we must conduct what is known as a paired 2-sample T-test . Essentially this means we must perform some sort of transformation to our data so that we have independant observations. One way to do this is instead of considering these patients as two different populations is to use the differences between the related observations. So let's define a new collumn for our data set which we will call "Delta_Experimental" and have it take the values: $$ \Delta_i = X_{i,b} - X_{i,f}$$

We compute this collumn in the following way:

In [4]:
trial_data['Delta_Experimental'] = trial_data['Treatment_Base'] - trial_data['Treatment_End']
trial_data.iloc[:20]
Out[4]:
Control_Base Control_End Treatment_Base Treatment_End Delta_Experimental
Patient_Number
0 15.96 23.73 13.75 18.71 -4.96
1 53.47 26.49 19.16 23.02 -3.86
2 16.52 21.06 12.04 24.41 -12.37
3 21.69 39.10 44.60 29.89 14.71
4 25.56 8.07 6.90 5.60 1.30
5 40.39 33.61 7.37 25.69 -18.32
6 27.53 23.71 27.54 14.83 12.71
7 7.34 14.74 1.32 3.41 -2.09
8 20.00 35.60 4.13 32.86 -28.73
9 28.14 17.15 6.36 36.03 -29.67
10 15.30 35.42 28.27 15.20 13.07
11 31.57 23.79 54.38 37.64 16.74
12 8.02 1.00 20.62 19.28 1.34
13 45.97 24.50 3.23 26.00 -22.77
14 42.02 13.44 28.09 8.52 19.57
15 14.96 56.78 11.06 48.77 -37.71
16 29.17 45.53 19.71 14.89 4.82
17 20.72 24.61 37.63 23.63 14.00
18 7.56 7.21 8.78 35.43 -26.65
19 21.32 9.25 20.14 5.50 14.64

Note that each element of this new collumn is independant by our assumptions. Morover we also know that by using central limit theory and the law of large numbers that $\bar{\Delta} \sim \mathcal{N}(\mu_b-\mu_f, \frac{2\sigma^2}{n})$. Observe that this test statistic satisfies the same assumptions as the test statistics we used for one sample $T$-tests, and so we can proceed in the following way. Let's first compute the sample mean and variance of this modified data:

In [9]:
import numpy as np
delta_experimental_bar = trial_data['Delta_Experimental'].mean()
delta_experimental_s = trial_data['Delta_Experimental'].var()
print('the sample mean of Delta_Experimental is: ' + str(np.around(delta_experimental_bar,2)) 
      + ' the sample variance is: '+ str(np.around(delta_experimental_s,2)))
the sample mean of Delta_Experimental is: -8.17 the sample variance is: 344.82

We are still calculating the sample variance as we did in the previous test since we do not know a priori what the original variance of our data sets was. Since $\mathbb{E}[\bar{\Delta}] = \mu_b - \mu_f$ we know that if $H_0$ is true then that implies $\mathbb{E}[\bar{\Delta}] \geq 0$, so we can conduct a one sided test. Our $p$-value will be given by:

$$ p = \mathbb{P}(t \leq \sqrt{n}\frac{\bar{\Delta}}{s}) = \mathbb{P}(t \leq \sqrt{30}\frac{-8.17}{\sqrt{344.82}})$$

Here we observe that since we are using the differences to estimate the sample variance of $\bar{\Delta}$ our pivot has $n-1$ degrees of freedom. Using code we can calculate the $p$-value as follows:

In [11]:
from scipy.stats import t
p = t.cdf(np.sqrt(30)*((delta_experimental_bar)/np.sqrt(delta_experimental_s)),29)

print('the p-value of the test is ' + str(p))
the p-value of the test is 0.0112728936071

Using a 95% confidence level, what decision should we take with respect to this experiment? (accept the $H_0$, reject $H_0$, collect additional data).

Next we want to examine if the experimental treatment is better then the control treatment (in terms of average active time). Unlike before where we only needed data from one of these groups here we clearly need the data of both groups from the experiment. What is a possible null hypothesis we can analyze to answer this question?

One potential null hypothesis we should consider is of the following form: $$H_0 = \{\text{the treatment on average does not increase activity more then the control}\} $$

Clearly our Delta_Experimental collumn from before since it is directly involved in the analysis but we also need to consider the net change of the control group's physical activit. So much like before lets add a new collumn to our data set which we will call "Delta_Control" and compute it in the same manner as we did for the experimental group but using the control group's data.

In [12]:
trial_data['Delta_Control'] = trial_data['Control_Base'] - trial_data['Control_End']
trial_data.iloc[:20]
Out[12]:
Control_Base Control_End Treatment_Base Treatment_End Delta_Experimental Delta_Control
Patient_Number
0 15.96 23.73 13.75 18.71 -4.96 -7.77
1 53.47 26.49 19.16 23.02 -3.86 26.98
2 16.52 21.06 12.04 24.41 -12.37 -4.54
3 21.69 39.10 44.60 29.89 14.71 -17.41
4 25.56 8.07 6.90 5.60 1.30 17.49
5 40.39 33.61 7.37 25.69 -18.32 6.78
6 27.53 23.71 27.54 14.83 12.71 3.82
7 7.34 14.74 1.32 3.41 -2.09 -7.40
8 20.00 35.60 4.13 32.86 -28.73 -15.60
9 28.14 17.15 6.36 36.03 -29.67 10.99
10 15.30 35.42 28.27 15.20 13.07 -20.12
11 31.57 23.79 54.38 37.64 16.74 7.78
12 8.02 1.00 20.62 19.28 1.34 7.02
13 45.97 24.50 3.23 26.00 -22.77 21.47
14 42.02 13.44 28.09 8.52 19.57 28.58
15 14.96 56.78 11.06 48.77 -37.71 -41.82
16 29.17 45.53 19.71 14.89 4.82 -16.36
17 20.72 24.61 37.63 23.63 14.00 -3.89
18 7.56 7.21 8.78 35.43 -26.65 0.35
19 21.32 9.25 20.14 5.50 14.64 12.07

Since each data point in each of these new collumns corresponds to a separate patient it would not make sense to assume that there is dependence. Instead, let's assume both populations are independent of each other and that the observations are i.i.d within each population. Again, we are considering averages so using central limit theorem we know that the mean of the experimental deviations $\bar{\Delta}_e \sim \mathcal{N}(\mu_e,\frac{\sigma_e^2}{n_e})$ and the mean of the control deviations $\bar{\Delta}_c \sim \mathcal{N}(\mu_c,\frac{\sigma_c^2}{n_c})$. We use two different values for the variance since we assume that each population has a different variance value. This allows us to rewrite $H_0$ in the following way: $$H_0 = \{\mu_c \leq \mu_e \} $$

Next we need to consider our test statistic and pivot which we can use for analysis. Given properties of normal distributions we know that the difference $\bar{\Delta}_c - \bar{\Delta}_e \sim \mathcal{N}(\mu_c-\mu_e,\frac{\sigma_e^2}{n_e} + \frac{\sigma_c^2}{n_c})$. Again, since the variances are unknown a priori we need to compute some sort of $T$-test. Since we have two independent populations the test we will preform is known as a 2 sample unpaired T-test . Our test statistic will be the difference of the means, but note that two compute a suitable t-distributed pivot we need some notion of the sample variance, and the variance of the difference between the two populations. We can use the following unbiased estimate of sample variance for this purpos: $$s^2 = \frac{1}{n_e + n_c -2}\big( \sum_{i=1}^{30}(\Delta_{c,i} - \bar{\Delta}_c )^2 + \sum_{i=1}^{30}(\Delta_{e,i} - \bar{\Delta}_e )^2 \,\big) $$

Observe that to calculate $s^2$ we had to use 60 data points and that we had to calculate to intermediate statistics (namely the sample means of the populations). This would indicate the statistic $s^2$ has 60-2 = 58 degrees of freedom. In general in unpaired $T$-tests, if one group has $n_1$ data points and the other group has $n_2$ data points then our test statistic will have $n_1 +n_2 -2$ degrees of freedom. So lets compute the relevant statistics:

In [16]:
delta_control_bar = trial_data['Delta_Control'].mean()
unpaired_variance = 0

for val in trial_data['Delta_Experimental']:
    unpaired_variance += (val-delta_experimental_bar)**2

for val in trial_data['Delta_Control']:
    unpaired_variance += (val-delta_control_bar)**2
    
unpaired_variance = unpaired_variance/(58.0)

print('The mean of the control deltas is: ' + str(np.around(delta_control_bar,2))+
      ' the unpaired variance is: ' + str(np.around(unpaired_variance,2)))
The mean of the control deltas is: 3.12 the unpaired variance is: 326.46

Now that we have our statistics computed we can calculate the $p$-value. This can be done in the following way: $$p = \mathbb{P}\big(t > \frac{\bar{\Delta}_c - \bar{\Delta}_e}{s\sqrt{\frac{2}{30}}} \big) = \mathbb{P}\big(t >\sqrt{15}\frac{3.12 + 8.17}{\sqrt{326.46}} \big)$$

Where t has the student t distribution with 58 degrees of freedom. Using code we can compute this value as:

In [17]:
p = 1- t.cdf(np.sqrt(15)*((delta_control_bar - delta_experimental_bar)/np.sqrt(unpaired_variance)),58)

print('the p-value of the test is ' + str(p))
the p-value of the test is 0.0093627495894

Using a 95% confidence level, what decision should we make with respect to the data? (accept $H_0$, reject $H_0$, collect additional data).