Archive for the ‘Statistics’ Category

Adaptive UI

Thursday, October 9th, 2008

Designing a website, or a User Interface for a program is hard work.  One has to study people and their often hard to predict and difficult to quantity decision making processes.

For instance, google.com shows a clear knowledge that user’s get distracted by anything other than the task at hand if presented with multiple options.  Thus their site has one main focus… a search bar.  They, are are after all, a search company.

google.jpg yahoo-screenshot-2.jpg

But what if User Interface designers, and website designers started to pay attention to the actual math behind user interactions?  Like living organisms, what if the website or user interface adapted to the user?  For instance, if the website knew that people browsing on cloudy days liked a more cheerful design why wouldn’t it brighten up the background?

Or perhaps in Yahoo’s case why not reorder the sidebar links in order of most clicked on?  What about Yahoo’s search bar?  Might that feature become larger and more prominent?

Why aren’t websites and Program designers doing this already?  They may be, but the sheer volume and dynamism of large user bases makes it a very difficult and ever changing job.

I suggest that website and User Interface designers move to a design framework which includes the ability for the site or program to collect user statistics, and then reorganize itself according to some preset rules.  Flex, Flash, php, python and a host of other programming languages could all support these features really easily.  And if you can’t think of a better optimization algorithm, simply allow the user interactions with the site to be a sort of genetic algorithm.

What are you going to do with that Ph.D. in stats… math.

Monday, April 7th, 2008

Mathematicians and Statisticians are often asked, and usually in a highly dubious tone, “so…. what are you going to do with that advanced stats/math degree?”. Sometimes it’s followed with, “teach?”

So many people must have asked this question that the department of Statistics at Colombia University put on a whole conference devoted to the subject. The conference included panelists from both Academia and Industry.

A Postdoc from the department, commented on the conference.

I thought the following was a good observation:

There is not as much flexibility in industry as with academia (research must be in the companies interests), however, the compensation is usually much better.

All industry panelists agreed that statisticians must be excited by data.

It’s true, to be a good mathematical/statistical researcher you must be excited by data!

Best Time to Release Your Good Post

Friday, December 28th, 2007

Have you ever wondered what the best time to release your post, article, blog or news release is?

The answer is immediately prior to Monday, Tuesday, or Wednesday mornings.

These stats were gathered from the July-Dec Scroggles! webserver.

(more…)

R-Code for Simulating a Confidence Interval for the Difference in Two Binomial Random Variables

Wednesday, August 22nd, 2007

Hypothesis testing is quite common in statistics. Usually a hypothesis comes in the form: “Random Variable (i.e. statistic/measure) A is not different from Random Variable B at the p=.05 level”. We test the hypothesis that A is “not” different than B, because it is impossible to test whether A is the same as B when dealing with random variables. As my boss says, “Association does not imply causation.” However, if we can reject the hypothesis that A is not different than B then we can say that there is a statistically significant difference between A and B, which is what we ultimately want to show.

“Statistically significant” does not mean “a whole lot different from”, or “proved different” it means, “different enough that using the properties of the distribution of A and the distribution of B I can’t conclude from the given evidence that A and B have the same distribution or are generated by the same function.”

Who knew that significant could have such a nuanced meaning. The phrase “Significant other” really is much more meaningful under this connotation…i.e. my girlfriend/boyfriend is not generated by the same function as me (read “we aint brother and sister”)…

Anyhow, usually one will test a hypothesis by first assuming which distribution both A and B come from, work out the necessary math for their differential distribution, mark out a test statistic and badda bing arrive at confidence interval. (For a deeper explanation see http://www.cas.lancs.ac.uk/glossary_v1.1/hyptest.html#hypothtest ) The confidence interval is itself a random variable and is usually interpreted as an interval which contains the true measure x% of the time.

However, in most real cases the distribution of A and B are such that doing the required math for the differential distribution is rather cumbersome. Thus, being mathematicians (in my case an applied mathematician), we might look for an approximation or a shortcut.

For instance, in Bracoo’s post on Athletes and Lawlessness , he states that the population of NBA basketball players with a criminal record is 40% while the population of average US citizens with a criminal record is only 21% . It seems that this is a striking difference, but is it?. Furthermore, we can look a the lift in NBA criminal records as compared to the average US citizen, which will give us a percentage difference between the two:

percentgain = (NBA-US)/US = (.40-.21)/.21 = 90.4 %

Evaluating the claim that there is a 90.4% greater likelihood of an NBA basketball player commiting a crime than a US citizen could get quite sticky if we tried to work out the theoretical answer via mathematical statistics.

Using R we can quickly simulate, or approximate, a confidence interval and a hypothesis test.

First, let’s assume that the population of basketball players is 360 (30 teams with 12 players each) and that we have sampled them all and gotten honest responses. Furthermore we have sampled a truly random selection of US citizens, and gotten 360 true responses. We know that population percentages are distributed according to a binomial distribution.

NBA ~ iid Binomial(p=.4) with N=360

US ~ iid Binomial(p=.21) with N=360

Since Binomial r.v.’s look alot like Normal distributions when n is high (i.e. greater than ~ 30) we can approximate our above distributions with:

Binomial(p) with N ~ Normal(p, p(1-p)/N)
NBA ~ iid Normal(.4, .4(1-.4)/360)
US ~ iid Normal(.21, .21(1-.21)/360)

The following R-code will quickly produce a .95 confidence interval (significance at the p=.05 level)


NBA = rnorm(1000, .4, sqrt(.4*(1-.4)/360))

US = rnorm(1000, .21, sqrt(.21*(1-.21)/360))

lift = (NBA-US)/US

confidence.95 = quantile(lift, c(.025, .975))

print(confidence.95)

If our test statistic is above or below the two numbers given then we can say that statistically the lift between NBA criminal record rates and US citizen rates is significantly different than the test statistic.

In this case, our confidence interval for our lift metric is: 52% and 143% . Since our test statistic (0 = (NBA-US)/US => they are the same) falls outside the interval (52%, 143%) then we can say that at the p=.05 level, we cannot ascribe the difference between NBA crime record rates and US general population rates to random variance in the population.

The advantage to simulating confidence intervals is that with relatively low error, and little time we can get good estimation of a distribution that would otherwise be very difficult and time consuming to calculate by hand. If we wanted to publish these results, we would likely need to do the real math. But for a quick result the above method will usually suffice.

R-Code for Generating a Cumulative Distribution Function

Wednesday, August 22nd, 2007

I was looking for a way to create a cumulative distribution function (CDF) in R today and for once, it doesn’t have something I’m looking for! Actually it’s more likely that I just wasn’t looking for the right thing. Anyways, I figured out how to produce a nice little plot of the CDF. I’m sure that you could generate a nicer one with some interpolation and such but in case you need a quick one here goes.

Update: A nice visitor to the blog showed me that indeed I had overlooked a very simple way to do what I will illustrate in less elegant code immediately after this note.

# x is a vector of items that you wish to find the CDF for
plot(ecdf(x))

End Update

And now for the less elegant way of doing it.

# x is a vector of items that you wish to find the CDF for
x.hist = hist(x, plot=FALSE, breaks=100)
x.counts = x.hist$counts
x.mids = x.hist$mids
x.cdf = cumsum(x.counts)/sum(x.counts)
#and plot it
plot(x.mids, x.cdf, type="s", main="title", xlab="value", ylab="cumulative probabilities")

You could define your histogram of x to be more or less detailed (breaks=n) and it will still plot correctly.

(more…)