re-post: European Lisp Symposium 2009
João Pavão Martins and Ernesto Morgado, will be presenting :) Also had them also as teachers in college, together with António Leitão, they'r a very bright bunch. High quality teaching coming out from that Artificial Intelligence group.
Clojure and Selenium
Motivation
I needed a kind of crawler to go around a list of pages, invoke some javascript and collect that output.
Curl or a regular http lib’s don’t do the trick because i need to run javascript on each requested page. For that i can use Selenium, Selenium is a great framework to perform web testing, that uses directly a browser and thus we can run Javascript.
Selenium can be scripted from Java which matches very well with my wish to learn Clojure :)
Solution
Its not really a crawler in the sense that it does not go around automatically following all the links it finds, it actually gets the list of links to check from the site sitemap.xml.
As some sitemaps.xml are huge, i added also a little pick-a-sample function that randomly selects only a subset from all the sitemap.
Code
Im on the process of learning Clojure, so probably a lot of things could be improved.
For Selenium, we need first to start the server, then the client, and then use the client to browse the pages. As is not very elegant to have a “start server” and “start client” on the top of the script and a “stop client” and “stop server” call at the end of the script, so i’ve wrapped it around a macro (one of the major strengths of Lisp like languages).
The whole thing goes like this:
process-sitemap receives a sitemap, transforms it into a map(with xml-to-zip), collects the url links in it, then picks a sample from them(with pick-a-sample) and calls check-pages with them.
check-pages gets a list of urls. It starts by using the macro, obtains a-browser from it, then iterates over the list of urls, calling check-a-page on each url(a-url). Note that at this point the standard output is redirected to a file, so i can log the results from check-a-page.
check-a-page gets a-browser and a-url, so you can guess what it will do :)It opens that url in the browser, calls the javascript, and prints to standard output the return of the js call.
Hope google does not mind to use their site as an example. But do not run this on Google site, its just an example, use it on your own site!
For this to run you will need to have in your classpath a bunch of jar libs, this is how my lib folder looks like:
lib/
clojure-contrib.jar
clojure.jar
jline-0.9.94.jar
selenium-java-client-driver.jar
selenium-server.jar
I called this app “coverager”
Main Code: coverager.clj
And of course tests: coverager_test.clj
Take Aways
Clojure is great! Its my opinion that on the Lisp family of languages the code is more elegant and visually cleaner than the C family.
I don’t care much for working directly with the Java language, but working on the JVM with other languages like JRuby, Clojure,
I suspect i will be spending more time with Clojure in future :)
book: Collective Intelligence in Action
Its goes through all the process of: explaining the data structures needed to represent data in your app, how to collect data from web(blogs, etc), theory & implementation of mining algorithms, leveraging open source tools like Weka and Lucene, up to how Amazon, Google News and Netflix do their data mining(to some extent).
Clojure Reference
And Clojure page from same author here.
re-post: Almost Viral
- Almost viral can be safer than full on viral. (specially if your product is new)
- The real cost per acquisition should also account for the viral coefficient. And the growth of acquisition is a geometric series.
book: The 4-Hour Workweek
A favorite of mine.
Some discussed topics and quotes:
- 80/20 Pareto's rule.
- Parkinson's Law.
- The need for an "Information Diet" and the excess of information.
- Spending a lot of time on something unimportant does not make it important.
- Lack of time is actually lack of priorities.
- Automations to free you from work.
- Virtual Assistants to free you from work.
- People don’t want to be millionaires. They want to experience what they believe only millions can buy.
- Less is not laziness. Focus on being productive instead of busy.
- Slow down and remember this: Most things make no difference.
Starting up Clojure simple tips
Here’s a couple of things that i use, when learning about Clojure.Im on Mac and using Textmate.
Screencasts & Tutorial
Here are a series of 10 screencasts that i found to be quite useful Intro to Clojure (youtube playlist)
And there’s also Rich Hickey screencasts http://clojure.blip.tv/, but they are longer, so start the with ones above.
For reading material take a look at Mark Volkmann Clojure very complete notes here. And his clojure references here.
Folder Structure for code
bin/
repl.sh
run.sh
lib/
clojure-contrib.jar
clojure.jar
jline-0.9.94.jar
*.jar
README
src/
*.clj
test/
*.clj
Assuming a structure, is possible to do some automation, see next.
Textmate Builds run.sh and repl.sh
To address the issues of including jar’s you want to use in the Classpath here’s a little ruby script that I trigger from Textmate. It looks into the lib/ dir and creates the repl.sh and run.sh scripts:
#!/usr/bin/env ruby -wKU
project_dir=ENV['TM_FILEPATH'].split("/")[0...-1].join("/")
libs_path = Dir[project_dir + "/../lib/*.jar"].collect
File.open(project_dir+"/../bin/run.sh", "w+") do |fil|
fil.puts "java -cp .:#{libs_path.join(':')} clojure.lang.Script #{ENV['TM_FILEPATH']}"
end
File.open(project_dir+"/../bin/repl.sh", "w+") do |fil|
fil.puts "java -cp .:#{libs_path.join(':')} jline.ConsoleRunner clojure.lang.Repl #{ENV['TM_FILEPATH']}"
end
So if you want to use some new jar’s just place them inside the lib/ and trigger this ruby code that will re-generate the the run.sh and repl.sh scripts.
Existing code
Other thing that i keep close by is existing Clojure code examples, i also use Textmate so that they are just a keyboard shortcut away(they are 3 different commands):
git fetch
mate '/somewhere-in-your-disk/programming-clojure/'
svn up
mate '/somewhere-in-your-disk/clojure-contrib/'
svn up
mate '/somewhere-in-your-disk/svn_clojure/trunk'
you can get this code onto your machine from:
Clojure core
http://code.google.com/p/clojure/
svn checkout http://clojure.googlecode.com/svn/trunk/ clojure-read-only
Clojure contrib
http://code.google.com/p/clojure-contrib/
svn checkout http://clojure-contrib.googlecode.com/svn/trunk/ clojure-contrib-read-only
Programming Clojure book:
http://github.com/stuarthalloway/programming-clojure/tree/master
API
Another handy keyboard shortcut is to get the clojure api web page, also directly from Textmate, setup a new command with the code:
open http://clojure.org/api
Google it
A lot of people are starting to explore Clojure, so google it and you’ll find a lot of posts and tips on clojure already out there.
HTML to PDF & Printing
At the same time CSS is great to format layouts of documents in a clean, reusable way… Almost seems ideal, but then when you get it onto PDF(or print) it gets all funky…
MultiMarkdown to PDF & Printing
I’m especially interested in this because i am a fan of the so called lightweight markup languages like Markdown(specially the MultiMarkdown extension). Thus avoiding having to go into Microsoft Word.
MultiMarkdown can be easily outputted into HTML and use CSS for formating, but then when trying to get it onto PDF(or printing), trouble starts…
You can either print it(or create PDF) directly from the browser and although it does keep the css formating still has all the troubles described in begin of post… Alternatively, you can transform the MultiMarkdown to LateX and then to PDF, the output is nice but the CSS formatting does not work there, and although the LateX supports formating, its complicated, so in practice i always ended up with the same looking PDF, which ends up being rather boring…
recently i found a 3rd option.
Prince XML
On a comment of a previous blog post on lightweight markup languages, someone suggested an alternative tool: Prince XML
Its a command line application, that gets an html file as input(plus a bunch of other options) and outputs a PDF file. Go check some samples here and example web applications using it here.
I found it so much better from the alternatives i’ve tried, that Im posting it in case someone else is twiddling around the same issues.
You can even find a Google talk about it, with Håkon Wium Lie(who proposed CSS originally and is Opera CTO) and Michael Day (system architect for Prince).
For Textmate users
I’ve created a simple command, so i get my MultiMarkdown directly onto PDF using 1 command, (its originally the Multimarkdown command that creates HTML with a couple of more lines added that produce the PDF using prince), here’s the code:
# Process the MultiMarkdown document into HTML and then PDF.
NAME="${TM_FILEPATH:-untitled}"
BASENAME="${NAME%.*}"
cd "${TM_MULTIMARKDOWN_PATH:-$HOME/Library/Application Support/MultiMarkdown}"
cd bin
#to HML
./multimarkdown2XHTML.pl > "$BASENAME.html"
#to PDF
`prince "$BASENAME.html" -o "$BASENAME.pdf"`
#open PDF
`open "$BASENAME.pdf"`
This assumes you have Prince XML installed in your system, and you can call it from command line.