Hypothesis testing is quite common in statistics. Usually a hypothesis comes in the form: “Random Variable (i.e. statistic/measure) A is not different from Random Variable B at the p=.05 level”. We test the hypothesis that A is “not” different than B, because it is impossible to test whether A is the same as B when dealing with random variables. As my boss says, “Association does not imply causation.” However, if we can reject the hypothesis that A is not different than B then we can say that there is a statistically significant difference between A and B, which is what we ultimately want to show.
“Statistically significant” does not mean “a whole lot different from”, or “proved different” it means, “different enough that using the properties of the distribution of A and the distribution of B I can’t conclude from the given evidence that A and B have the same distribution or are generated by the same function.”
Who knew that significant could have such a nuanced meaning. The phrase “Significant other” really is much more meaningful under this connotation…i.e. my girlfriend/boyfriend is not generated by the same function as me (read “we aint brother and sister”)…
Anyhow, usually one will test a hypothesis by first assuming which distribution both A and B come from, work out the necessary math for their differential distribution, mark out a test statistic and badda bing arrive at confidence interval. (For a deeper explanation see http://www.cas.lancs.ac.uk/glossary_v1.1/hyptest.html#hypothtest ) The confidence interval is itself a random variable and is usually interpreted as an interval which contains the true measure x% of the time.
However, in most real cases the distribution of A and B are such that doing the required math for the differential distribution is rather cumbersome. Thus, being mathematicians (in my case an applied mathematician), we might look for an approximation or a shortcut.
For instance, in Bracoo’s post on Athletes and Lawlessness , he states that the population of NBA basketball players with a criminal record is 40% while the population of average US citizens with a criminal record is only 21% . It seems that this is a striking difference, but is it?. Furthermore, we can look a the lift in NBA criminal records as compared to the average US citizen, which will give us a percentage difference between the two:
percentgain = (NBA-US)/US = (.40-.21)/.21 = 90.4 %
Evaluating the claim that there is a 90.4% greater likelihood of an NBA basketball player commiting a crime than a US citizen could get quite sticky if we tried to work out the theoretical answer via mathematical statistics.
Using R we can quickly simulate, or approximate, a confidence interval and a hypothesis test.
First, let’s assume that the population of basketball players is 360 (30 teams with 12 players each) and that we have sampled them all and gotten honest responses. Furthermore we have sampled a truly random selection of US citizens, and gotten 360 true responses. We know that population percentages are distributed according to a binomial distribution.
NBA ~ iid Binomial(p=.4) with N=360
US ~ iid Binomial(p=.21) with N=360
Since Binomial r.v.’s look alot like Normal distributions when n is high (i.e. greater than ~ 30) we can approximate our above distributions with:
Binomial(p) with N ~ Normal(p, p(1-p)/N)
NBA ~ iid Normal(.4, .4(1-.4)/360)
US ~ iid Normal(.21, .21(1-.21)/360)
The following R-code will quickly produce a .95 confidence interval (significance at the p=.05 level)
NBA = rnorm(1000, .4, sqrt(.4*(1-.4)/360))
US = rnorm(1000, .21, sqrt(.21*(1-.21)/360))
lift = (NBA-US)/US
confidence.95 = quantile(lift, c(.025, .975))
print(confidence.95)
If our test statistic is above or below the two numbers given then we can say that statistically the lift between NBA criminal record rates and US citizen rates is significantly different than the test statistic.
In this case, our confidence interval for our lift metric is: 52% and 143% . Since our test statistic (0 = (NBA-US)/US => they are the same) falls outside the interval (52%, 143%) then we can say that at the p=.05 level, we cannot ascribe the difference between NBA crime record rates and US general population rates to random variance in the population.
The advantage to simulating confidence intervals is that with relatively low error, and little time we can get good estimation of a distribution that would otherwise be very difficult and time consuming to calculate by hand. If we wanted to publish these results, we would likely need to do the real math. But for a quick result the above method will usually suffice.