In the last few discussions we considered various forms of null hypothesis testing to try and validate our assumptions about observed data. In this discussion we will look at an analagous notion to null hypothesis testing, namely confidence intervals and confidence bounds . We will look at a few examples of computing these sorts of statistics.
But first let's go over a few definitions. Lets suppose we are interested in estimating some unknown qantity $\theta$, and we can only get noisy observations of this quantity. A $1-\alpha$ upper confidence bound (UCB) on $\theta$ is a random variable $\hat{\theta}_{UCB}$ which satisfies:
$$\mathbb{P}(\theta \leq \hat{\theta}_{UCB}) \geq 1-\alpha $$A $1-\alpha$ lower conffidence bound (LCB) on $\theta$ is a random variable $\hat{\theta}_{LCB}$ which satisfies:
$$\mathbb{P}(\theta \geq \hat{\theta}_{LCB}) \geq 1-\alpha$$Finally, a $1-\alpha$ confidence interval is a random interval $[\hat{\theta}_L,\hat{\theta}_U]$ such that:
$$\mathbb{P}(\hat{\theta}_{L}\leq \theta \leq \hat{\theta}_{U}) \geq 1-\alpha$$Confidence bounds and intervals are some of the most comonly used statistics and also the most missinturpreted. The definition of a confidence interval does NOT mean that there is a $1-\alpha$ probability of $\theta$ lying in this interval, but rather that if we were to repeat our analysis procedure at least $(1-\alpha)\cdot100 \%$ of the time $\theta$ would be in our interval. To see why this is the case, note that as per our assumptions $\theta$ is not a random variable so it could either be in the computed interval or not, there is no random chance it can jump into the interval once it is computed. On the other hand, the actual confidence interval itself is a random variable (since it is computed from random data), and hence depends on how the data is collected and processed. In reality, confidence intervals tell us about how sound our data collection and analysis methods are. In this discussion we will take a look at how to compute confidence bounds and intervals.
Recall our one sample hypothesis testing example where we were interested in analyzing the effectiveness of a marketing campaign for our website. Over the course of two weeks we made the following observations in terms of traffic to our site:
import pandas as pd
web_traffic_data = pd.read_csv('traffic_data.csv')
web_traffic_data
We noted that our data's standard devation (based on prior knowlege) is known to be 400 visitors. Originally we wanted to infer if the average number of visitors has increased over a previously stated value which was 2000 visitors each day on average. Let's now analyze this data in a different point of view. Instead of stating a null hypothesis test we will construct upper and lower confidence bounds for our mean, as well as a confidence interval. Our assumptions are that we have 14 i.i.d samples of the population, so we know from the central limit theory and law of large numbers that our sample mean $\bar{X}_{14} \sim \mathcal{N}(\mu,\frac{160,000}{14})$, where the true mean $\mu$ is the unknown quantity we want to bound. Let's begin by asking, for general confidence level of $1-\alpha$ how could we calculate an upper and lower confidence bound on $\mu$?.
Recall from one sample hypothesis testing, that our first step should be to convert our test statistic into a pivot with a known distribution. Since we are assuming known variance we can utilize the $Z$ pivot which we constructed in the following manner:
$$Z = \frac{\bar{X}_{14}-\mu}{400}\cdot\sqrt{14} \sim \mathcal{N}(0,1)$$So how can we construct a bound using this pivot? Note that since the distribution of $Z$ is known and is a standard normal, we can easily find a value $z_\alpha$ such that:
$$1 - \Phi(z_\alpha) = \mathbb{P}(Z \geq z_\alpha) = 1-\alpha $$Here the function $\Phi(z)$ is the cdf of the standard normal distribution. Essentially we can think of $z_alpha$ as the $\alpha$ percentile of the standard normal distribution. So, substituting our pivot calculation for $Z$ we obtain the following: $$ \mathbb{P}(Z \geq z_\alpha) = \mathbb{P}(\frac{\bar{X}_{14}-\mu}{400}\cdot\sqrt{14} \geq z_\alpha) = \mathbb{P}(\mu \leq \bar{X}_{14} - z_\alpha \frac{400}{\sqrt{14}} )$$
Since the normal distribution is symmetric we know that $1 - \Phi(z_\alpha) = \Phi(-z_{1-\alpha})$ so we can write our upper confidence bound as:
$$ \hat{\mu}_{UCB} = \bar{X}_{14} + z_{1-\alpha} \frac{400}{\sqrt{14}} $$And likewise our lower confidence bound is given by:
$$\hat{\mu}_{LCB} = \bar{X}_{14} - z_{1-\alpha} \frac{400}{\sqrt{14}}$$Observe that the more variance we have in our measurments the wider these bounds get, but the more samples we collect the bounds shrink. Using a 95% confidence level let's construct these bounds. First we need to compute the sample mean:
x_bar = web_traffic_data['Number_of_Visitors'].mean()
print('The sample average is: ' + str(round(x_bar,2)))
Now lets compute the bounds using the $z_{1-\alpha}$ quantiles, this can be done using a normal look up table or by referencing a statistics software package.
from scipy.stats import norm
import numpy as np
alpha = 0.05
z_val = norm.ppf(1-alpha, loc=0, scale=1)
mu_ucb = x_bar + z_val*(400.0/np.sqrt(14))
mu_lcb = x_bar - z_val*(400.0/np.sqrt(14))
print('the upper confidence bound on the mean is: '+str(np.around(mu_ucb,2))+' the lower confidnce bound for the mean is: ' + str(np.around(mu_lcb,2)))
How should we interpret this numbers? Can we say that with 95% confidence that the interval $[1286.09,1637.77]$ contains the true mean?
Clearly no, the reason why this is the case is because our bounds are only conserend with deveaitons towards a single tale but the confidence interval would require bounding deviations in both directions. In this sence we can think of upper confidence bonds and lower confidence bounds as analagous to one sided tests while confidence intervals correspond to two sided tests. We can use the definitions to see this mathematically. Suppose we want to consider the probability the true mean is not contained in this interval then:
$$ \mathbb{P}(\mu \notin [\hat{\mu}_{LCB},\hat{\mu}_{UCB}]) \leq \mathbb{P}(\mu \geq \hat{\mu}_{UCB}) + \mathbb{P}(\mu \leq \hat{\mu}_{LCB}) = 2\alpha$$Here the last equality follows since for our example we were able to construct these bounds with equality. This means that if we use this interval then:
$$ \mathbb{P}(\mu \in [\hat{\mu}_{LCB},\hat{\mu}_{UCB}]) \geq 1-2\alpha$$Clearly $1-2\alpha < 1-\alpha$ so this would not be a $1-\alpha$ confidence interval. What this suggests however is that we could construct a $1-\alpha$ confidence interval by constructing two $1-\frac{\alpha}{2}$ confidence bounds and combining them. Hence a $1-\alpha$ confidence interval $[\hat{\mu}_L,\hat{\mu}_U]$ will be computed for our example as:
$$\hat{\mu}_L = \bar{X}_{14} - z_{1-\alpha/2} \frac{400}{\sqrt{14}}$$$$\hat{\mu}_U = \bar{X}_{14} + z_{1-\alpha/2} \frac{400}{\sqrt{14}}$$Using these formulas we can now compute the confidence interval using the following code block:
z_val = norm.ppf(1-alpha/2.0, loc=0, scale=1)
mu_u = x_bar + z_val*(400.0/np.sqrt(14))
mu_l = x_bar - z_val*(400.0/np.sqrt(14))
print('the 95% confidence interval for the mean is: ['+str(np.around(mu_l,2))+','+str(np.around(mu_u,2))+']')
Note that the lower and upper bounds of the interval are clearly more concervative then the one sided bounds. How should we interpret these results?
We will now apply confidence interval analysis to our second one sample hypothesis testing example. Recall that in this scenario we have collected some data from from 30 participants in an activity tracker experiment. We measured each of the 30 participants average active time over the last two weeks of the experiments to see if ther had been any effect from wearing the tracker. The data from this experiment is below:
activity_data = pd.read_csv('activity_data.csv')
activity_data.iloc[:20]
Our assumptions about this data sample were that these are i.i.d observations with unknown mean and unknown variance. Using these assumptions, we note that central limit theory states that the sample mean $\bar{X}_{30} \sim \mathcal{N}(\mu,\frac{\sigma^2}{\sqrt{30}})$, where both $\mu$ and $\sigma^2$ are unknown. Let us use these notions to construct confidence bounds and a confidence interval for the mean of our data. Let's start in the general case and consider $1-\alpha$ confidence bounds and intervals. Note that again, we need to convert a statistic with an unknown distribution to some sort of pivot with a known distribution with which we can construct the intervals. In particular since both mean and variance are unknown we can use a $T$ pivot given by: $$ T = \frac{\bar{X}_{30}-\mu}{s^2} \sim \mathcal{T}(29)$$
Here $s^2$ is the sample variance, and $\mathcal{T}(d)$ reflects a T-distribution with $d$ degrees of freedom. Since the most constraining statistic we are computing for this pivot is the sample variance we again have $n-1 = 29$ degrees of freedom for this particular scenario. If we want to construct an upper confidence bound we can use the same technique as we did for the known variance case by considering the $1-alpha$ percentile of this distributoin given by:
$$\mathbb{P}(T \leq t_{1-\alpha,29}) = 1 - \alpha$$Using the same algebra as before this allows us to compute the upper confidence bound as:
$$ \hat{\mu}_{UCB} = \bar{X}_{30} + t_{1-\alpha,29}\frac{s}{\sqrt{30}} $$And the lower confidence bound will be given as:
$$ \hat{\mu}_{LCB} = \bar{X}_{30} - t_{1-\alpha,29}\frac{s}{\sqrt{30}} $$There are a few key differfence to note between these bounds and the bounds computed for the case of known variance. First note that the sample variance replaces the true variance, and hence if our observations have high sample variance our bounds will increase. Furthermore, we need to maintain the information about the number of degrees of freedom in our estimation. Let us compute the upper and lower confidence bounds for our current data set. First, we need to get the sample mean and variance:
x_bar = activity_data['Active_Time_min'].mean()
s_squared = activity_data['Active_Time_min'].var()
print('The sample average is: ' + str(round(x_bar,2))+ ' and the sample variance is: ' + str(round(s_squared,2)))
Now let us compute the bounds for a confidence level of 95%. This can also be done using a T-table look up.
from scipy.stats import t
alpha = 0.05
t_val = t.ppf(1-alpha,29)
mu_ucb = x_bar + t_val*(np.sqrt(s_squared/30.0))
mu_lcb = x_bar - t_val*(np.sqrt(s_squared/30.0))
print('the upper confidence bound on the mean is: '+str(np.around(mu_ucb,2))+' the lower confidnce bound for the mean is: ' + str(np.around(mu_lcb,2)))
How should we interpret these results?
Much like in the known variance case, to construct a $1-\alpha$ confidence interval for the mean we can construct a lower and upper confidence level boud at the $1-\frac{\alpha}{2}$ level of confidence. This means that for our current data set this would be given by $[\hat{\mu}_L,\hat{\mu}_U]$, where:
$$ \hat{\mu}_{U} = \bar{X}_{30} + t_{1-\alpha/2,29}\frac{s}{\sqrt{30}} $$$$ \hat{\mu}_{L} = \bar{X}_{30} - t_{1-\alpha/2,29}\frac{s}{\sqrt{30}} $$Again we must keep track of the number of degrees of freedom in the calculation of our statistics. These values can be calculated using code or a T-table look up. For our example this interval computation can be done as follows:
t_val = t.ppf(1-alpha/2.0, 29)
mu_u = x_bar + t_val*(np.sqrt(s_squared/30.0))
mu_l = x_bar - t_val*(np.sqrt(s_squared/30.0))
print('the 95% confidence interval for the mean is: ['+str(np.around(mu_l,2))+','+str(np.around(mu_u,2))+']')
Much like the case of knwon variance, confidence interval bounds are more conservative then the one directional bounds. How should we interpret the results of our analysis?