Alexandre's Notebook: Visualizing Sampling Size Accuracy with Pretzels

A common problem when doing data mining is that the data you need to work with, is too big(too many data) and painfully slow to process, especially if you want to run complex mining algorithms on it.

Sampling allows to select randomly, a subset of examples from the total data, and then work with only that smaller subset. It works, because quite often the subset is statistically equivalent to the total data we are looking at.

The tricky part is choosing how big this sample should be, because we need to guarantee that the subset we pick is big enough to represent all the (total)data accurately, and at the same time we want it small as possible, to be easy to work with.

For example, lets look at the mean of the weight of 50 bags of pretzels,

I'm using the language Clojure, with the library Incanter:

We see that the real mean(of total data) is 451.2, so our sample mean should come as close to that values as possible.

For this experiment, i'll use incanter's sample method, that returns a sample of the given size from the given collection.

So, for a sample size of 5, the whole procedure is: pick 5 pretzel bags weights at random, calculate the mean and subtract that with the real mean, the resulting value shows the error(how far we are from the real mean).
Doing that only once, might happen that we pick 5 bag weighs that match exactly the real mean, and thus be fooled by a one time lucky strike, to avoid that lets repeat that procedure 10.000 times.

And lets also use increasingly bigger sample sizes:

As expected, as the sample size gets bigger, less the error and closer to the real mean.

So, how small subset can it really be? Well, it all depends on how accurate you need to get, and thus how big error, you tolerate.
Also note that the error does not decreases linearly. increasing 5->10 samples we decreased error by 1, increasing 10->40 samples also decreased error by 1.

So, best way to have an idea on the error variating with the sample size is to visualize it:

Note that it starts to level off at the end, meaning that adding more samples when the number of samples is already big will have less impact, compared to adding them when the sample size is small. And of course with different data, the error curve will show different pattern, so Visualizing it is a great way to figure out the sample size you need.

Kudos for Incanter, i'm having a great experience using it.

3 comments:

Anonymous said...: How do u compare R to Incanter? Are u ready to leave R for it?
Nice to hear a fellow countryman with such knowledge of the things that interest me too!
Abraço do Porto!

Filipe; December 26, 2009 at 2:14 PM
Alexandre Martins said...: Hi, from my still small experience with them, I'd say that, R has been around longer, with a lot of contributors thus is more mature and with a bigger library of functionalities, more documentation and learning sources available.

Incanter is newer, but being on the JVM, has the possibility of leveraging Java libraries(which is a big advantage). And if you develop an app on the JVM then Incanter merges into your application seamlessly.

Incanter is built with Clojure which is a wonderful programming language and if you develop in Clojure then is dream match to be able to use the same language for statistical analysis that you use to develop your app.

I do have a bias for Incanter, and i suspect is growing in popularity in near future. I prefer to use Incanter especially because i like Clojure development, but it can't do all the things that R can.

Abraco, and Happy Hacking :) Alex.; December 28, 2009 at 11:31 AM
Anonymous said...: Clojure seems pretty cool! I use R but I find it slow and the language has some limitations. I was searching for ways to speed up R and found Clojure/Incanter! Then I investigated Clojure, and it has amazing features!!! I was also looking for a general purpose language since I was dissatisfied with C and Java... Ruby was going to be my choice but then I found Clojure...:)
And I can use weka libraries directly on my Clojure apps!!!
It's great!:)
Happy HackingDays and 2010!

Filipe; December 31, 2009 at 1:00 PM