This is part 3 of a series of Statistics brief introduction, see here for part 2 and here for part 1.
Have you ever seen on tv the counting of the votes on election day? where they keep showing all the day the status of the votes count? Well in Portugal they do a full day of this, and yes, is very boring... But one of the things that amazed me was how could they know who was going to win, before they counted all the votes... At some point they said: we have counted, for example, 75% of all votes, and we can say the president Y is going to win, and the final results will be between A and B! How amazing is that? How can that be? How can they know what the people that is not counted yet, has voted for?
Are they working with some magic crystal ball where you can see the future? And if they have this magic crystal ball, why don't they just guess the lottery and quit working for this boring voting shows?... Very intriguing stuff...
Normal distribution
When mathematicians started to look an how to make statistics one thing they noticed is that most data when displayed in a histogram graph(where x=the value itself, y=number of times each value occurs in the sample data) it often looks like a bell shape graph:
This distribution (the way values get distributed on the graph) has the name of normal distribution. And because happens so frequently, is a very useful(i'd say even essential) notion to be aware of.
So if we know that the data we are looking at has a normal distribution then we know that it possesses some known properties and that fact guarantees that we can extract accurate statistics from it.
Note that the normal distribution is not the only one that exists. But is by far the most important one: http://en.wikipedia.org/wiki/Normal_distribution.
There are also techniques to convert non-normal distributions to normal ones, to some extent.
So on the election votes count story, what they do is: by looking at previous voting years, they've seen that the voting follows a normal distribution, so they can automatically assume that the voting this year will also follow the normal distribution. And then even with only a partial count of the votes, we can accurately predict what is going to happen between some reliability limits. Clever stuff no?!
And guess what, voting is not the only thing you can apply this to! You can actually try to apply this for some more interesting things :)
So, how do I use it ?
Lets say you have collected sample data from an experiment, and you need to figure out from that sample what is the overall population results.
1st: Make sure that data follows normal distribution.
And this even applies to simple things like on the Muffins story, where we did a simple average calculation, but because data didn't follow normal distribution the average is not accurate enough!
A common used technique is to use the QQ normal plot, which plots the expected normal distribution against your data.
So lets take the "precip", which is a default data set of R, and holds the "Precipitation [in/yr] for 70 US cities", and plot it.
> qqnorm(precip, ylab = "Precipitation [in/yr] for 70 US cities")
> qqline(precip)
qqnorm generates the plot of the values, and qqline makes a line passing by the middle of the values.
The better the values(the dots) fit into that line, the closer the data it is to the normal distribution.
2nd: Calculate the confidence intervals.
If the data we have follows the normal distribution, then, we know is like:
Thus we can calculate and interval that will say the the value will be between X and Y with 68%(34.1+34.1) accuracy or, for example, 95% accuracy.
Let see an example and use the students t-test method, there are also other ways to do it, but a students t-test is a good way to go, because is slightly more tolerant to experimental data, that always contains some measurements errors.
Lets use again this "precip" data set, "Precipitation [in/yr] for 70 US cities"
> t.test(precip)
and among other information we can see:
95 percent confidence interval:
31.61748 38.15395
This means 95% of times that we see precipitation in one of these 70 US cities, it will be between 31.61748 an 38.15395.
You see, we can guess the future to some extent :)
Back to muffins story:
Thus, given the data:
{ monday:2, tuesday:3, wednesday:1, thursday:2, friday:68, saturday:1, sunday:0 }
1. Does it follow normal distribution?
> muffins = c(2,3,1,2,68,1,0)
> qqnorm(muffins, ylab = "Muffins sold per day")
> qqline(muffins)
There a nice looking line including most of the points except one! That one is not normal(that is: not normally distributed).
Plotting this graph is also very useful to spot those strange values.
2. Lets calculate the Confidence intervals for that distribution:
2.1 WITH the un-normal point
> t.test(muffins)
95 percent confidence interval:
-12.26252 34.26252
Say what? Negative even...This does not make sense... This happens, because of that un-normal point. We cannot use this calculation in data that is not normally distributed!
Lets try now without that point.
2.1 WITHOUT the un-normal point.
In Statistics language the un-normal point would be called an outlier, and one common trick, is just to eliminate it from the data. Because is just too out to be valid, is most likely an error of measurement or an un-normal event.
So now we get:
> muffins_clean = c(2,3,1,2,1,0)
> t.test(muffins_clean)
95 percent confidence interval:
0.3993426 2.6006574
Ok, you see, this makes more sense. It says that 95% percent of times, the values of sold muffins will be between 0.3993426 and 2.6006574. Rounding up means between 0 and 3.
if we want to have a smaller gap(but covering less cases), we can do:
> t.test(muffins_clean, conf.level = 0.75)
75 percent confidence interval:
0.9429669 2.0570331
thus, 75% of days sales will be between 1 and 2 muffins. Which makes a lot more sense than the originally estimated 11!!
A rule of thumb is: the more accurate we want to be the more data we should include in the test.
This is part 3 of a 3 parts brief introduction to statistics. Here for part 2 and here for part 1
No comments:
Post a Comment