Alexandre's Notebook: April 2009

re-post: Web Analytics fun

nice one: mathematica with google analytics api

re-post: European Lisp Symposium 2009

programme details

João Pavão Martins and Ernesto Morgado, will be presenting :) Also had them also as teachers in college, together with António Leitão, they'r a very bright bunch. High quality teaching coming out from that Artificial Intelligence group.

Clojure and Selenium

Motivation

I needed a kind of crawler to go around a list of pages, invoke some javascript and collect that output.

Curl or a regular http lib’s don’t do the trick because i need to run javascript on each requested page. For that i can use Selenium, Selenium is a great framework to perform web testing, that uses directly a browser and thus we can run Javascript.

Selenium can be scripted from Java which matches very well with my wish to learn Clojure :)

Solution

Its not really a crawler in the sense that it does not go around automatically following all the links it finds, it actually gets the list of links to check from the site sitemap.xml.

As some sitemaps.xml are huge, i added also a little pick-a-sample function that randomly selects only a subset from all the sitemap.

Code

Im on the process of learning Clojure, so probably a lot of things could be improved.

For Selenium, we need first to start the server, then the client, and then use the client to browse the pages. As is not very elegant to have a “start server” and “start client” on the top of the script and a “stop client” and “stop server” call at the end of the script, so i’ve wrapped it around a macro (one of the major strengths of Lisp like languages).

The whole thing goes like this:

process-sitemap receives a sitemap, transforms it into a map(with xml-to-zip), collects the url links in it, then picks a sample from them(with pick-a-sample) and calls check-pages with them.

check-pages gets a list of urls. It starts by using the macro, obtains a-browser from it, then iterates over the list of urls, calling check-a-page on each url(a-url). Note that at this point the standard output is redirected to a file, so i can log the results from check-a-page.

check-a-page gets a-browser and a-url, so you can guess what it will do :)It opens that url in the browser, calls the javascript, and prints to standard output the return of the js call.

Hope google does not mind to use their site as an example. But do not run this on Google site, its just an example, use it on your own site!

For this to run you will need to have in your classpath a bunch of jar libs, this is how my lib folder looks like:

         lib/
        clojure-contrib.jar
        clojure.jar
        jline-0.9.94.jar
        selenium-java-client-driver.jar
        selenium-server.jar

I called this app “coverager”

Main Code: coverager.clj

And of course tests: coverager_test.clj

Take Aways

Clojure is great! Its my opinion that on the Lisp family of languages the code is more elegant and visually cleaner than the C family.

I don’t care much for working directly with the Java language, but working on the JVM with other languages like JRuby, Clojure, and harnessing all the vast amount of Java libs and infrastructure out there is a major major advantage.

I suspect i will be spending more time with Clojure in future :)

Photo: Lucky Dice

book: Collective Intelligence in Action

From editors site:

Collective Intelligence in Action is a hands-on guidebook for implementing collective-intelligence concepts using Java. It is the first Java-based book to emphasize the underlying algorithms and technical implementation of vital data gathering and mining techniques like analyzing trends, discovering relationships, and making predictions.This book is for Java developers implementing collective intelligence in real, high-use applications.

Overall i liked it and have learned loads, it does have more Java code in it that i'd wanted, but for the ones who intent to use Java language the code in the book is handy, well, its after all a Java book, right? There's also some very useful, class diagrams that will help you to use the code described from JRuby, Clojure, Scala...

Its goes through all the process of: explaining the data structures needed to represent data in your app, how to collect data from web(blogs, etc), theory & implementation of mining algorithms, leveraging open source tools like Weka and Lucene, up to how Amazon, Google News and Netflix do their data mining(to some extent).

Liked also the data mining(machine learning) summaries on different data mining techniques, very compact and nicely done.

Clojure Reference

Great reference (an quite complete, as far as i can tell) to the Clojure language. by Mark Volkmann

And Clojure page from same author here.

re-post: Make your own Fonts

from: asaxeo.blogspot.com

re-post: Almost Viral

Article about online marketing strategies: Paid & Viral.

couple of points:

Almost viral can be safer than full on viral. (specially if your product is new)
The real cost per acquisition should also account for the viral coefficient. And the growth of acquisition is a geometric series.

Great use of math to build a formula that models the (real) cost per acquisition.

re-post: How to Become As Rich As Bill Gates

fun read here

re-post: Shit is not easy. Stuff takes time.

Nice (but slightly harsh) "rant" from Steve Yegge about Complexity: Have you ever legalized marijuana?

book: The 4-Hour Workweek

A favorite of mine.

Some discussed topics and quotes:

80/20 Pareto's rule.
Parkinson's Law.
The need for an "Information Diet" and the excess of information.
Spending a lot of time on something unimportant does not make it important.
Lack of time is actually lack of priorities.
Automations to free you from work.
Virtual Assistants to free you from work.
People don’t want to be millionaires. They want to experience what they believe only millions can buy.
Less is not laziness. Focus on being productive instead of busy.
Slow down and remember this: Most things make no difference.

Thanks José Santos, for the suggestion.

A Google SMS

My geeky mind thinks is kinda cool to get an sms from Google.

(I blanked out the content)

Starting up Clojure simple tips

Here’s a couple of things that i use, when learning about Clojure.Im on Mac and using Textmate.

Screencasts & Tutorial

Here are a series of 10 screencasts that i found to be quite useful Intro to Clojure (youtube playlist)

And there’s also Rich Hickey screencasts http://clojure.blip.tv/, but they are longer, so start the with ones above.

For reading material take a look at Mark Volkmann Clojure very complete notes here. And his clojure references here.

Folder Structure for code


bin/
repl.sh
run.sh
lib/
clojure-contrib.jar
clojure.jar
jline-0.9.94.jar
*.jar
README
src/
*.clj
test/
*.clj

Assuming a structure, is possible to do some automation, see next.

Textmate Builds run.sh and repl.sh

To address the issues of including jar’s you want to use in the Classpath here’s a little ruby script that I trigger from Textmate. It looks into the lib/ dir and creates the repl.sh and run.sh scripts:


#!/usr/bin/env ruby -wKU

project_dir=ENV['TM_FILEPATH'].split("/")[0...-1].join("/")

libs_path = Dir[project_dir + "/../lib/*.jar"].collect

File.open(project_dir+"/../bin/run.sh", "w+") do |fil|
fil.puts "java -cp .:#{libs_path.join(':')} clojure.lang.Script #{ENV['TM_FILEPATH']}"
end

File.open(project_dir+"/../bin/repl.sh", "w+") do |fil|
fil.puts "java -cp .:#{libs_path.join(':')} jline.ConsoleRunner clojure.lang.Repl #{ENV['TM_FILEPATH']}"
end

So if you want to use some new jar’s just place them inside the lib/ and trigger this ruby code that will re-generate the the run.sh and repl.sh scripts.

Existing code

Other thing that i keep close by is existing Clojure code examples, i also use Textmate so that they are just a keyboard shortcut away(they are 3 different commands):

 git fetch
mate '/somewhere-in-your-disk/programming-clojure/'

svn up
mate '/somewhere-in-your-disk/clojure-contrib/'

svn up
mate '/somewhere-in-your-disk/svn_clojure/trunk'

you can get this code onto your machine from:

 Clojure core
http://code.google.com/p/clojure/
svn checkout http://clojure.googlecode.com/svn/trunk/ clojure-read-only

Clojure contrib
http://code.google.com/p/clojure-contrib/
svn checkout http://clojure-contrib.googlecode.com/svn/trunk/ clojure-contrib-read-only

Programming Clojure book:
http://github.com/stuarthalloway/programming-clojure/tree/master

API

Another handy keyboard shortcut is to get the clojure api web page, also directly from Textmate, setup a new command with the code:

open http://clojure.org/api

Google it

A lot of people are starting to explore Clojure, so google it and you’ll find a lot of posts and tips on clojure already out there.

HTML to PDF & Printing

Making HTML into PDF properly is normally mission impossible: tables and images get cut in the middle, titles can appear alone at end of page with the text starting on the next page, headers and footers are normally non-existing or have some strange stuff from the browsers, like the file path to the html file… normally margins make no sense are either too big or too small, etc etc…

At the same time CSS is great to format layouts of documents in a clean, reusable way… Almost seems ideal, but then when you get it onto PDF(or print) it gets all funky…

MultiMarkdown to PDF & Printing

I’m especially interested in this because i am a fan of the so called lightweight markup languages like Markdown(specially the MultiMarkdown extension). Thus avoiding having to go into Microsoft Word.

MultiMarkdown can be easily outputted into HTML and use CSS for formating, but then when trying to get it onto PDF(or printing), trouble starts…

You can either print it(or create PDF) directly from the browser and although it does keep the css formating still has all the troubles described in begin of post… Alternatively, you can transform the MultiMarkdown to LateX and then to PDF, the output is nice but the CSS formatting does not work there, and although the LateX supports formating, its complicated, so in practice i always ended up with the same looking PDF, which ends up being rather boring…

recently i found a 3rd option.

Prince XML

On a comment of a previous blog post on lightweight markup languages, someone suggested an alternative tool: Prince XML

Its a command line application, that gets an html file as input(plus a bunch of other options) and outputs a PDF file. Go check some samples here and example web applications using it here.

I found it so much better from the alternatives i’ve tried, that Im posting it in case someone else is twiddling around the same issues.

You can even find a Google talk about it, with Håkon Wium Lie(who proposed CSS originally and is Opera CTO) and Michael Day (system architect for Prince).

For Textmate users

I’ve created a simple command, so i get my MultiMarkdown directly onto PDF using 1 command, (its originally the Multimarkdown command that creates HTML with a couple of more lines added that produce the PDF using prince), here’s the code:

# Process the MultiMarkdown document into HTML and then PDF.

NAME="${TM_FILEPATH:-untitled}"
BASENAME="${NAME%.*}"

cd "${TM_MULTIMARKDOWN_PATH:-$HOME/Library/Application Support/MultiMarkdown}"
cd bin

#to HML
./multimarkdown2XHTML.pl > "$BASENAME.html"

#to PDF
`prince "$BASENAME.html" -o "$BASENAME.pdf"`

#open PDF
`open "$BASENAME.pdf"`

This assumes you have Prince XML installed in your system, and you can call it from command line.