A common problem when doing data mining is that the data you need to work with, is too big(too many data) and painfully slow to process, especially if you want to run complex mining algorithms on it.
Sampling allows to select randomly, a subset of examples from the total data, and then work with only that smaller subset. It works, because quite often the subset is statistically equivalent to the total data we are looking at.
The tricky part is choosing how big this sample should be, because we need to guarantee that the subset we pick is big enough to represent all the (total)data accurately, and at the same time we want it small as possible, to be easy to work with.
For example, lets look at the mean of the weight of 50 bags of pretzels,
data:image/s3,"s3://crabby-images/7ae34/7ae34dc16a0419ea2c4eeede6d4503591fbc967f" alt=""
I'm using the language
Clojure, with the library
Incanter:
We see that the real mean(of total data) is 451.2, so our sample mean should come as close to that values as possible.
For this experiment, i'll use incanter's sample method, that returns a sample of the given size from the given collection.
So, for a sample size of 5, the whole procedure is: pick 5 pretzel bags weights at random, calculate the mean and subtract that with the real mean, the resulting value shows the error(how far we are from the real mean).
Doing that only once, might happen that we pick 5 bag weighs that match exactly the real mean, and thus be fooled by a one time lucky strike, to avoid that lets repeat that procedure 10.000 times.
And lets also use increasingly bigger sample sizes:
As expected, as the sample size gets bigger, less the error and closer to the real mean.
So, how small subset can it really be? Well, it all depends on how accurate you need to get, and thus how big error, you tolerate.
Also note that the error does not decreases linearly. increasing 5->10 samples we decreased error by 1, increasing 10->40 samples also decreased error by 1.
So, best way to have an idea on the error variating with the sample size is to visualize it:
data:image/s3,"s3://crabby-images/1fdb4/1fdb4ea126c86a00e9f9368a32f43124ed39b1b3" alt=""
Note that it starts to level off at the end, meaning that adding more samples when the number of samples is already big will have less impact, compared to adding them when the sample size is small. And of course with different data, the error curve will show different pattern, so Visualizing it is a great way to figure out the sample size you need.
Kudos for
Incanter, i'm having a great experience using it.