Clojure and Selenium

Motivation

I needed a kind of crawler to go around a list of pages, invoke some javascript and collect that output.

Curl or a regular http lib’s don’t do the trick because i need to run javascript on each requested page. For that i can use Selenium, Selenium is a great framework to perform web testing, that uses directly a browser and thus we can run Javascript.

Selenium can be scripted from Java which matches very well with my wish to learn Clojure :)

Solution

Its not really a crawler in the sense that it does not go around automatically following all the links it finds, it actually gets the list of links to check from the site sitemap.xml.

As some sitemaps.xml are huge, i added also a little pick-a-sample function that randomly selects only a subset from all the sitemap.

Code

Im on the process of learning Clojure, so probably a lot of things could be improved.

For Selenium, we need first to start the server, then the client, and then use the client to browse the pages. As is not very elegant to have a “start server” and “start client” on the top of the script and a “stop client” and “stop server” call at the end of the script, so i’ve wrapped it around a macro (one of the major strengths of Lisp like languages).

The whole thing goes like this:

process-sitemap receives a sitemap, transforms it into a map(with xml-to-zip), collects the url links in it, then picks a sample from them(with pick-a-sample) and calls check-pages with them.

check-pages gets a list of urls. It starts by using the macro, obtains a-browser from it, then iterates over the list of urls, calling check-a-page on each url(a-url). Note that at this point the standard output is redirected to a file, so i can log the results from check-a-page.

check-a-page gets a-browser and a-url, so you can guess what it will do :)It opens that url in the browser, calls the javascript, and prints to standard output the return of the js call.

Hope google does not mind to use their site as an example. But do not run this on Google site, its just an example, use it on your own site!

For this to run you will need to have in your classpath a bunch of jar libs, this is how my lib folder looks like:

         lib/
clojure-contrib.jar
clojure.jar
jline-0.9.94.jar
selenium-java-client-driver.jar
selenium-server.jar


I called this app “coverager”

Main Code: coverager.clj



And of course tests: coverager_test.clj



Take Aways

Clojure is great! Its my opinion that on the Lisp family of languages the code is more elegant and visually cleaner than the C family.

I don’t care much for working directly with the Java language, but working on the JVM with other languages like JRuby, Clojure, and harnessing all the vast amount of Java libs and infrastructure out there is a major major advantage.

I suspect i will be spending more time with Clojure in future :)

1 comment:

Shawn said...

Great job, and thanks for taking the time to write this.