Alexandre's Notebook: July 2009

book: Super Crunchers

Nice book, giving a lot of reasons on why data crunching is so important.
Enjoyed also the story about the 2 SD(Standard Deviation) Rule, used for point estimation.

Visualizing Sampling Size Accuracy with Pretzels

A common problem when doing data mining is that the data you need to work with, is too big(too many data) and painfully slow to process, especially if you want to run complex mining algorithms on it.

Sampling allows to select randomly, a subset of examples from the total data, and then work with only that smaller subset. It works, because quite often the subset is statistically equivalent to the total data we are looking at.

The tricky part is choosing how big this sample should be, because we need to guarantee that the subset we pick is big enough to represent all the (total)data accurately, and at the same time we want it small as possible, to be easy to work with.

For example, lets look at the mean of the weight of 50 bags of pretzels,

I'm using the language Clojure, with the library Incanter:

We see that the real mean(of total data) is 451.2, so our sample mean should come as close to that values as possible.

For this experiment, i'll use incanter's sample method, that returns a sample of the given size from the given collection.

So, for a sample size of 5, the whole procedure is: pick 5 pretzel bags weights at random, calculate the mean and subtract that with the real mean, the resulting value shows the error(how far we are from the real mean).
Doing that only once, might happen that we pick 5 bag weighs that match exactly the real mean, and thus be fooled by a one time lucky strike, to avoid that lets repeat that procedure 10.000 times.

And lets also use increasingly bigger sample sizes:

As expected, as the sample size gets bigger, less the error and closer to the real mean.

So, how small subset can it really be? Well, it all depends on how accurate you need to get, and thus how big error, you tolerate.
Also note that the error does not decreases linearly. increasing 5->10 samples we decreased error by 1, increasing 10->40 samples also decreased error by 1.

So, best way to have an idea on the error variating with the sample size is to visualize it:

Note that it starts to level off at the end, meaning that adding more samples when the number of samples is already big will have less impact, compared to adding them when the sample size is small. And of course with different data, the error curve will show different pattern, so Visualizing it is a great way to figure out the sample size you need.

Kudos for Incanter, i'm having a great experience using it.

Quoting Björk on a friday morning

from my twitter:

Good Morning World, from Sunny Tallinn. Am enjoying my latte and feeling well! but i know is not for long...

As Björk would say: Its all so peaceful and quiet until you open the eMAIL, bam boom, you blow a fuse and the sky up above is caving in.

re-post: Java, please stop ruining my fun.

Spot-on description of Java hiccups that slow down Clojure development.

To minimize my own pain on this, i've made up a little template project, that I start from, when doing Clojure code, it looks like this:

clj-template/

build.properties

build.xml

clj

lib/

clojure-contrib.jar

clojure.jar

jline-0.9.94.jar

README

src/

myapp/

helper.clj

myapp.clj

test/

myapp/

my_test.clj

myapp_test.clj

I edit build.properties to specify the class of the Main App (myapp), and the class of the tests(myapp-test).
lib/ contains all the jar's needed for the app, just drop a new one into lib/ and is automatically included when running or compiling.
./clj allows to get a repl or run a script if is called with a .clj file as argument. like: ./clj src/myapp.clj of course includes automatically all jar's existing in lib/
build.xml is used by ant(included with all Java distributions), where i have tasks for: compile, create stand-alone jar, run tests, extract documentation.
And then there's a bunch of small details to be aware off: naming of the Class has to match filename, specifying (:gen-class) in main Class to be able to create stand-alone jar, always defining a "-main" as the main entry point for the app, etc...

I see all these hiccups as a small cost of being on Java platform. When you then look at the benefits(of being in the Java platform) they largely overwhelm these costs. But yes they are annoying at times, so if we can fix them please do, after all programming should be about solving algorithmic puzzles not dependencies puzzles.

If programmers got paid to remove code from sofware instead of
writing new code, software would be a whole lot better.

- Nicholas Negroponte

Perfection is achieved, not when there is nothing more to add, but when there is nothing left to take away.

- Antoine de Saint-Exupéry

The letter I have written today is longer than usual because I lacked the time to make it shorter.

- Blaise Pascal

Alexandre's Notebook

book: Super Crunchers

Visualizing Sampling Size Accuracy with Pretzels

Quoting Björk on a friday morning

re-post: Java, please stop ruining my fun.

re-post: Augmented Reality Presentation

re-post: How did I not find this language years ago ?

Metric driven site improvements

Clojure Books

Quotes on Less is More