Alexandre's Notebook: September 2008

Statistics Part 3, the Election day

This is part 3 of a series of Statistics brief introduction, see here for part 2 and here for part 1.

Have you ever seen on tv the counting of the votes on election day? where they keep showing all the day the status of the votes count? Well in Portugal they do a full day of this, and yes, is very boring... But one of the things that amazed me was how could they know who was going to win, before they counted all the votes... At some point they said: we have counted, for example, 75% of all votes, and we can say the president Y is going to win, and the final results will be between A and B! How amazing is that? How can that be? How can they know what the people that is not counted yet, has voted for?
Are they working with some magic crystal ball where you can see the future? And if they have this magic crystal ball, why don't they just guess the lottery and quit working for this boring voting shows?... Very intriguing stuff...

Normal distribution

When mathematicians started to look an how to make statistics one thing they noticed is that most data when displayed in a histogram graph(where x=the value itself, y=number of times each value occurs in the sample data) it often looks like a bell shape graph:

This distribution (the way values get distributed on the graph) has the name of normal distribution. And because happens so frequently, is a very useful(i'd say even essential) notion to be aware of.

So if we know that the data we are looking at has a normal distribution then we know that it possesses some known properties and that fact guarantees that we can extract accurate statistics from it.

Note that the normal distribution is not the only one that exists. But is by far the most important one: http://en.wikipedia.org/wiki/Normal_distribution.
There are also techniques to convert non-normal distributions to normal ones, to some extent.

So on the election votes count story, what they do is: by looking at previous voting years, they've seen that the voting follows a normal distribution, so they can automatically assume that the voting this year will also follow the normal distribution. And then even with only a partial count of the votes, we can accurately predict what is going to happen between some reliability limits. Clever stuff no?!

And guess what, voting is not the only thing you can apply this to! You can actually try to apply this for some more interesting things :)

So, how do I use it ?

Lets say you have collected sample data from an experiment, and you need to figure out from that sample what is the overall population results.

1st: Make sure that data follows normal distribution.

And this even applies to simple things like on the Muffins story, where we did a simple average calculation, but because data didn't follow normal distribution the average is not accurate enough!

A common used technique is to use the QQ normal plot, which plots the expected normal distribution against your data.

So lets take the "precip", which is a default data set of R, and holds the "Precipitation [in/yr] for 70 US cities", and plot it.
> qqnorm(precip, ylab = "Precipitation [in/yr] for 70 US cities")
> qqline(precip)

qqnorm generates the plot of the values, and qqline makes a line passing by the middle of the values.

The better the values(the dots) fit into that line, the closer the data it is to the normal distribution.

2nd: Calculate the confidence intervals.

If the data we have follows the normal distribution, then, we know is like:

Thus we can calculate and interval that will say the the value will be between X and Y with 68%(34.1+34.1) accuracy or, for example, 95% accuracy.

Let see an example and use the students t-test method, there are also other ways to do it, but a students t-test is a good way to go, because is slightly more tolerant to experimental data, that always contains some measurements errors.

Lets use again this "precip" data set, "Precipitation [in/yr] for 70 US cities"
> t.test(precip)

and among other information we can see:

95 percent confidence interval:
31.61748 38.15395

This means 95% of times that we see precipitation in one of these 70 US cities, it will be between 31.61748 an 38.15395.

You see, we can guess the future to some extent :)

Back to muffins story:

Thus, given the data:

{ monday:2, tuesday:3, wednesday:1, thursday:2, friday:68, saturday:1, sunday:0 }

1. Does it follow normal distribution?

> muffins = c(2,3,1,2,68,1,0)
> qqnorm(muffins, ylab = "Muffins sold per day")
> qqline(muffins)

There a nice looking line including most of the points except one! That one is not normal(that is: not normally distributed).
Plotting this graph is also very useful to spot those strange values.

2. Lets calculate the Confidence intervals for that distribution:

2.1 WITH the un-normal point

> t.test(muffins)
95 percent confidence interval:
-12.26252 34.26252

Say what? Negative even...This does not make sense... This happens, because of that un-normal point. We cannot use this calculation in data that is not normally distributed!

Lets try now without that point.

2.1 WITHOUT the un-normal point.

In Statistics language the un-normal point would be called an outlier, and one common trick, is just to eliminate it from the data. Because is just too out to be valid, is most likely an error of measurement or an un-normal event.

So now we get:

> muffins_clean = c(2,3,1,2,1,0)
> t.test(muffins_clean)
95 percent confidence interval:
0.3993426 2.6006574

Ok, you see, this makes more sense. It says that 95% percent of times, the values of sold muffins will be between 0.3993426 and 2.6006574. Rounding up means between 0 and 3.

if we want to have a smaller gap(but covering less cases), we can do:
> t.test(muffins_clean, conf.level = 0.75)
75 percent confidence interval:
0.9429669 2.0570331

thus, 75% of days sales will be between 1 and 2 muffins. Which makes a lot more sense than the originally estimated 11!!

A rule of thumb is: the more accurate we want to be the more data we should include in the test.

This is part 3 of a 3 parts brief introduction to statistics. Here for part 2 and here for part 1

Statistics Part 2, the basic measures

Basic Measures

This is the 2nd of a 3 parts brief introduction to statistics. Here for part 1 and here for part 3.

Wikipedia definition for statistics: Statistics is a mathematical science pertaining to the collection, analysis, interpretation or explanation, and presentation of data.

And its the absolute basis for data mining! So a needed subject for the aspiring data miners.

A common situation is that you have a bunch of numbers from which you want to make a summary of, as in understand which numbers keep appearing more frequently(Descriptive statistics), what is the central value(average), but also maybe use it as a prediction on what will happen in future(Inferential statistics)...

These numbers are called sample data from an experiment. And it could be any number of things:

the amount of muffins sold in a bakery,
the number of times people click on "buy now" button from your online bakery,
the number of purple cars on sunday's,
the number of drunk people at 1 am versus at 5 am,
how many hours your cat sleeps per day,
how many shots on target does a basketball player achieve.

Some of these, if we create a statistic from them, can prove to be useful. For example, suppose you can only buy eggs at the market on the weekend. And you are wondering how many eggs are needed for cooking muffins for 1 full week. Because if you order too many eggs they will get bad and you have to throw them away, if you order too less, you don't have enough for the whole week.
So, if you create a statistic of how many muffins you sold last week, you will see how many eggs you will need to buy.

Using R software.
We will go trough some methods used for analyzing numbers (making statistics) using R software, which was built exactly for kind of thing!

Basic Statistic concepts
There's a few basic statistical concepts to be aware of.

Measures of central tendency
The best way to reduce a set of data and still retain part of the information, is to summarize the set with a single value.

mean
answears: what is the average from all the numbers?
> mean(c(10,11,12,9,13,11,11))
11

median
answears: (if we put all the numbers in order) what is the value in the middle?
> median(c(10,11,12,9,13,11,11))
11

mode
answears: what is the value that appears more frequently?
make an histogram for example:
> hist(c(10,11,12,9,13,11,11))
look at the graph and you can see the bigger bar means that 11 shows more frequently.

Measures of spread
While the measures of location attempt to identify the most representative value in a set of data a data description is not complete until the spread variability is also known. In fact, the basic numerical description of a data set requires measures of both centre and spread.

Range
answears: how broad is the interval of the min and max values in the data?
> range(c(10,11,12,9,13,11,11))
9 13

box plots
answears: visual representation of the data spread
> boxplot(c(10,11,12,9,13,11,11))

variance
answears: how far from the values the average are all the values?
> var(c(10,11,12,9,13,11,11))
1.66667

standard deviation
answers: how far from the values the average are all the values? normalized, that is, transformed into a value more easily interpreted.
> sd(c(10,11,12,9,13,11,11))
1.29

Measures of association
measures how strongly 2 sets of data are associated.

Covariance,
answers: how much two variables change together.
> cov(1:10, 5:14)
9.1667

Correlation, between 0 and 1
answers: how much two variables change together, like the covariance, but transformed to be more easily interpreted.
> cor(1:10, 5:14)
1

Correlation values are quickly grasped by looking at following graph:

This is the 2nd of a 3 parts brief introduction to statistics. Here for part 1 and here for part 3.

Once upon a time...

Another audio video experiment, now with voice.

Statistics Part 1, the Muffins Story

What are statistics good for?

Imagine you own a bakery and you have to figure out how many chocolate muffins you should cook per day.

So you go and talk to the chef:

You: How many muffins did we sold last week?
Chef: 77
You: Ok.
You(think): so if we sold 77 muffins in a week, i just need to calculate the average for each day
and i will know how many muffins we have to cook! So 77 in 7days(77/7=1), is 11 per day!
You: please cook 11 per day.
Chef: So many? are you sure?
You(think): Well, why not?, i calculated the average per day, everybody knows that the average is the
correct way to go, what could be wrong??
You: Yes im sure! cook 11!
... FAIL...

What could be wrong?

Example 1

If the amount of muffins sold per day was like this:

{ monday:10, tuesday:11, wednesday:12, thursday:9, friday:13, saturday:11, sunday:11 }

Then if you do an average:

(10+11+12+9+13+11+11)/7=11

You get 11 per day. And by looking at the daily results it is reasonable to expect that you will sell around 11 per day in future. All cool.

Example 2

But what really happened is that last week, was that on friday, there was a school visit to the bakery so the real amount of muffins sold per day, or all the week, were:

{ monday:2, tuesday:3, wednesday:1, thursday:2, friday:68, saturday:1, sunday:0 }

Thus, doing the average:

(2+ 3+ 1+ 2+ 68+ 1+ 0)/7=11

The average is 11 per day, so should the chef cook 11 per day? No! You can clearly see that baking 11 will be too much, even thought thats what average says!! So sometimes a simple average does not tells the whole store and thats what statistics are for, to avoid these mistakes and be to able to calculate reliable predictions.

Hope that give some motivation into why learning statistics is useful.

This is part 1 of 3, part 2 is here and part 3 here.