Sampling allows to select randomly, a subset of examples from the total data, and then work with only that smaller subset. It works, because quite often the subset is statistically equivalent to the total data we are looking at.
The tricky part is choosing how big this sample should be, because we need to guarantee that the subset we pick is big enough to represent all the (total)data accurately, and at the same time we want it small as possible, to be easy to work with.
I'm using the language Clojure, with the library Incanter:
We see that the real mean(of total data) is 451.2, so our sample mean should come as close to that values as possible.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
(use '(incanter core stats charts)) | |
;; weights (in grams) of 50 randomly sampled bags of preztels | |
(def weights [464 447 446 454 450 457 450 442 | |
433 452 449 454 450 438 448 449 | |
457 451 456 452 450 463 464 453 | |
452 447 439 449 468 443 433 460 | |
452 447 447 446 450 459 466 433 | |
445 453 454 446 464 450 456 456 | |
447 469]) | |
;; calculate the real mean | |
(mean weights) ;=> 451.2 |
We see that the real mean(of total data) is 451.2, so our sample mean should come as close to that values as possible.
For this experiment, i'll use incanter's sample method, that returns a sample of the given size from the given collection.
So, for a sample size of 5, the whole procedure is: pick 5 pretzel bags weights at random, calculate the mean and subtract that with the real mean, the resulting value shows the error(how far we are from the real mean).
Doing that only once, might happen that we pick 5 bag weighs that match exactly the real mean, and thus be fooled by a one time lucky strike, to avoid that lets repeat that procedure 10.000 times.
And lets also use increasingly bigger sample sizes:
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
;; 5 samples => 2.9595399991095066 | |
(mean (for [_ (range 10000)] (abs (- (mean weights) (mean (sample weights :size 5)))))) | |
;; 10 samples => 2.0840500000983475 | |
(mean (for [_ (range 10000)] (abs (- (mean weights) (mean (sample weights :size 10)))))) | |
;; 40 samples => 1.0537100001504645 | |
(mean (for [_ (range 10000)] (abs (- (mean weights) (mean (sample weights :size 40)))))) |
As expected, as the sample size gets bigger, less the error and closer to the real mean.
So, how small subset can it really be? Well, it all depends on how accurate you need to get, and thus how big error, you tolerate.
Also note that the error does not decreases linearly. increasing 5->10 samples we decreased error by 1, increasing 10->40 samples also decreased error by 1.
So, best way to have an idea on the error variating with the sample size is to visualize it:
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
(defn sample-mean-error | |
[tdata tsize] | |
(mean (for [_ (range 10000)] | |
(abs (- (mean tdata) | |
(mean (sample tdata :size tsize))))))) | |
(defn progressive-sample-mean-error | |
[tdata] | |
(for [tx (range 2 (count tdata))] | |
(sample-mean-error tdata tx))) | |
(def serror (progressive-sample-mean-error weights)) | |
(view (xy-plot (range 2 (count weights)) serror | |
:title "Sample mean error" | |
:y-label "error" | |
:x-label "number of samples" | |
)) |

Note that it starts to level off at the end, meaning that adding more samples when the number of samples is already big will have less impact, compared to adding them when the sample size is small. And of course with different data, the error curve will show different pattern, so Visualizing it is a great way to figure out the sample size you need.
Kudos for Incanter, i'm having a great experience using it.
Kudos for Incanter, i'm having a great experience using it.
3 comments:
How do u compare R to Incanter? Are u ready to leave R for it?
Nice to hear a fellow countryman with such knowledge of the things that interest me too!
Abraço do Porto!
Filipe
Hi, from my still small experience with them, I'd say that, R has been around longer, with a lot of contributors thus is more mature and with a bigger library of functionalities, more documentation and learning sources available.
Incanter is newer, but being on the JVM, has the possibility of leveraging Java libraries(which is a big advantage). And if you develop an app on the JVM then Incanter merges into your application seamlessly.
Incanter is built with Clojure which is a wonderful programming language and if you develop in Clojure then is dream match to be able to use the same language for statistical analysis that you use to develop your app.
I do have a bias for Incanter, and i suspect is growing in popularity in near future. I prefer to use Incanter especially because i like Clojure development, but it can't do all the things that R can.
Abraco, and Happy Hacking :) Alex.
Clojure seems pretty cool! I use R but I find it slow and the language has some limitations. I was searching for ways to speed up R and found Clojure/Incanter! Then I investigated Clojure, and it has amazing features!!! I was also looking for a general purpose language since I was dissatisfied with C and Java... Ruby was going to be my choice but then I found Clojure...:)
And I can use weka libraries directly on my Clojure apps!!!
It's great!:)
Happy HackingDays and 2010!
Filipe
Post a Comment