The Meta-Algorithm

Little background intro

The Bayesian filters, decision trees, neural networks, genetic algorithms, support vector machine are Algorithms that all work by reading data (that comes in 2 parts: input and result) and figures out a formula ( f(x) ). This formula, when is given the input can calculate the result. See figure 1.

Figure 1

All these algorithms originate from different fields, like artificial intelligence, math, databases, general computer science, and they are even influenced by biology(genetic algorithms) or even how the brain works(neural algorithms), so happens that from times to times there's some new ideas and progress, means also there's some room to grow and explore.

Example Applications:

The filters on your email. Suppose that one day you get in your inbox an email that offers to see some "magic pills" (input part) so you identify it as being spam(result part), the filter learns(builds the formula), so next time same style of email arrives in your inbox, the filter is able to identify it as spam.

Another common use is to identify fraud, these algorithms will look into past details(input part) of committed frauds(result part), learn from them(build the formula) and next time they can help in predicting fraud.

And final example application is to help avoid churn(customers leaving your service), works by looking at the characteristics(input part) of the customers that already churned(result part) and figure out a formula that predicts users with higher probability of churning. So you can run this formula over your current customers base to get aware of the ones that might be churning soon so you can act on it, to try avoid the churn.


The meta-algorithm

Now notice that all these Algorithms output a formula( f(x) ), which is essentially also an algorithm, both are a series of ordered instructions to perform a task. So extending this idea, why can't we have a meta-algorithm that outputs the algorithm?

Note that the current Algorithm job is to figure out the formula that given the input will output the result, so, using the same reasoning, the meta-algorithm job is to figure out an Algorithm that given the input and result will output the formula( f(x) ). See figure 2.

The Story of Stuff



"Externalized costs of production: you don't pay the real value of the thing..."

"Loads of waste..."

Estonian Wildlife Live Stream

See here and look for links that say striim.

I know of these 2:
  • Wild Pig: mms://tv.eenet.ee/siga
  • Eagles: mms://tv.eenet.ee/kotkas
Maybe cameras won't stay for long... but see these pictures my Dad made just yesterday:

AST, Javascript and the logging framework

Was recently looking to the ruby abstract syntax tree capabilities, after seeing this. The AST's is a tree representation of the structure of source code, normally created by the compiler/interpreter, as an intermediate step of executing your code. Having access to this allows you to do some heavy transformations, that normally the default language does not allow. So much power comes with a price, the language safety sandboxes go away, so handle with care.

When looking at this I found that Javascript, the super widespread browser language, also has couple capabilities in this field, directly from the the core language.

Specifically with the:
- toString() - creates a string with the body of a method (except the ones defined natively in C on the browser)
- eval(); - executes the code passed as a string.

Try in the Firebug console:


var one = function()
{
/* some logic */
return 1;
}

/* the function code */
console.log(one.toString());
// > "function () { return 1; }"

/* get the code of method and update it, adding a log message to top */
eval("var one = "+ one.toString().replace("{", "{ console.log(\"method one called\");"));

one();
// > method one called



See that the method "one" got re-defined with the eval, and we added a "console.log()" for debugging purposes. This can be useful to add debugging code without having to edit the original code.


Extending this idea, imagine a logging framework where instead of having your code explicitly calling logging methods from within your code, thus having your code sprinkled everywhere with alien logging code like it is usual, this framework would know how to attach itself onto your code and do the logging by itself.

And this is useful for all languages not only for Javascript.

Sending emails with Ruby using Gmail

I was looking for this some time ago, so posting here in case someone else finds it useful also.
I use it normally as an alert sender for automation's scripts.


Works naturally with regular ruby(1.8), but apparently also works with JRuby.

Depends on openssl gem. See at the end of code for command line commands to install it.

Code:

JRuby, Rails and Clojure

Found this funny comment at the end of a tutorial for JRuby + Rails + Clojure:

"If you’ve made it this far, congratulations! You’ve now combined an hugely practical webframework(RAILS) with a incredibly powerful language(LISP) running in the same enterprise-proof super-fast VM (JVM). The world lies at your feet."

The fun world of the Abstract Syntax Tree

Another nice post from igvita.com, this time on playing with the ruby AST tools.

Ever wanted to build a Ruby to Lolcode translator? (example)

re-Painting Mona Lisa

genetic algorithms re-painting pictures

The goal is to get an image represented as a collection of overlapping polygons of various colors and transparencies.
We start from random 50 polygons that are invisible. In each optimization step we randomly modify one parameter (like color components or polygon vertices) and check whether such new variant looks more like the original image. If it is, we keep it, and continue to mutate this one instead.

Fitness is a sum of pixel-by-pixel differences from the original image. Lower number is better.

Every check has a cost

Talks about the price of overdoingn procedures and processes in big company's.

Paul Graham: The Other Half of "Artists Ship"

Gotta to Keep It Simple!

Textmate

10 Productivity Tips

little photo organizer

Place this script in a folder full of pictures, run it, and it will organize the pictures into folders by the day they were taken.

Run it as: ruby photo_organizer.rb

Code:

#photo_organizer.rb

require 'fileutils'

def fileShortDate(fich)
dt = File.mtime(fich)
dt.strftime("%d-%m-%Y")
end

def isImagem(fich)
File.extname(fich).upcase == ".JPG" or File.extname(fich).upcase == ".PNG"
end

print "Creating dirs "
Dir.foreach(".") { |x| Dir.mkdir(fileShortDate(x)) and print(".") if (isImagem(x)) unless File.directory?(fileShortDate(x)) }

print "\nCopying pics "
Dir.foreach(".") { |x| FileUtils.mv(x, fileShortDate(x)+'/'+x) and print(".") if (isImagem(x)) }

Doing Javascript

with Ruby...

me likes the idea... With a higher level UI framework on top could be a new solution for platform independent ruby UI's.

book: Freakonomics


A book I just happened to run into last weekend at the bookstore, and what a great book it is!

Its all about answering questions like:

Which is more dangerous, a gun or a swimming pool? What do schoolteachers and sumo wrestlers have in common? Why do drug dealers still live with their moms? How much do parents really matter? How did the legalization of abortion affect the rate of violent crime?

The results of his finding are quite interesting, more about society than the regular boring economic stuff like stocks, inflation, etc... is quite entertaining.

And it does it in the right way, for example: at one point he says that schoolteachers cheat on national exams, but instead of just stating it, he actually proves it in a very data mining style, he goes into analyzing data of the exam results and looks up patterns of data that are clearly not normal and expected.

worth checking out.

official site: http://www.freakonomicsbook.com/

Google videos: Be Your Own Therapist



couple quotes from Ven. Robina, that i liked:

You think happiness is what you get, when you get what you want, but actually unhappiness happens because of the neurosis of the wanting, thus happiness is when you change your mind on the neurosis of the wanting. Is that simple!


You're the boss. You are your own boss!! You are able to make up your own mind.


I hate begging, can't stand the idea of standing there asking for 2 dollars... I believe instead in entrepreneurship, where i would offer a coffee and nice piece of pie for 7 dollars, thus giving you something good back, and still be able to get my 2 dollars...

Statistics Part 3, the Election day

This is part 3 of a series of Statistics brief introduction, see here for part 2 and here for part 1.



Have you ever seen on tv the counting of the votes on election day? where they keep showing all the day the status of the votes count? Well in Portugal they do a full day of this, and yes, is very boring... But one of the things that amazed me was how could they know who was going to win, before they counted all the votes... At some point they said: we have counted, for example, 75% of all votes, and we can say the president Y is going to win, and the final results will be between A and B! How amazing is that? How can that be? How can they know what the people that is not counted yet, has voted for?
Are they working with some magic crystal ball where you can see the future? And if they have this magic crystal ball, why don't they just guess the lottery and quit working for this boring voting shows?... Very intriguing stuff...


Normal distribution

When mathematicians started to look an how to make statistics one thing they noticed is that most data when displayed in a histogram graph(where x=the value itself, y=number of times each value occurs in the sample data) it often looks like a bell shape graph:



This distribution (the way values get distributed on the graph) has the name of normal distribution. And because happens so frequently, is a very useful(i'd say even essential) notion to be aware of.

So if we know that the data we are looking at has a normal distribution then we know that it possesses some known properties and that fact guarantees that we can extract accurate statistics from it.

Note that the normal distribution is not the only one that exists. But is by far the most important one: http://en.wikipedia.org/wiki/Normal_distribution.
There are also techniques to convert non-normal distributions to normal ones, to some extent.

So on the election votes count story, what they do is: by looking at previous voting years, they've seen that the voting follows a normal distribution, so they can automatically assume that the voting this year will also follow the normal distribution. And then even with only a partial count of the votes, we can accurately predict what is going to happen between some reliability limits. Clever stuff no?!

And guess what, voting is not the only thing you can apply this to! You can actually try to apply this for some more interesting things :)


So, how do I use it ?

Lets say you have collected sample data from an experiment, and you need to figure out from that sample what is the overall population results.

1st: Make sure that data follows normal distribution.

And this even applies to simple things like on the Muffins story, where we did a simple average calculation, but because data didn't follow normal distribution the average is not accurate enough!

A common used technique is to use the QQ normal plot, which plots the expected normal distribution against your data.

So lets take the "precip", which is a default data set of R, and holds the "Precipitation [in/yr] for 70 US cities", and plot it.
> qqnorm(precip, ylab = "Precipitation [in/yr] for 70 US cities")
> qqline(precip)

qqnorm generates the plot of the values, and qqline makes a line passing by the middle of the values.

The better the values(the dots) fit into that line, the closer the data it is to the normal distribution.

2nd: Calculate the confidence intervals.

If the data we have follows the normal distribution, then, we know is like:




Thus we can calculate and interval that will say the the value will be between X and Y with 68%(34.1+34.1) accuracy or, for example, 95% accuracy.

Let see an example and use the students t-test method, there are also other ways to do it, but a students t-test is a good way to go, because is slightly more tolerant to experimental data, that always contains some measurements errors.

Lets use again this "precip" data set, "Precipitation [in/yr] for 70 US cities"
> t.test(precip)

and among other information we can see:

95 percent confidence interval:
31.61748 38.15395

This means 95% of times that we see precipitation in one of these 70 US cities, it will be between 31.61748 an 38.15395.

You see, we can guess the future to some extent :)


Back to muffins story:

Thus, given the data:

{ monday:2, tuesday:3, wednesday:1, thursday:2, friday:68, saturday:1, sunday:0 }

1. Does it follow normal distribution?

> muffins = c(2,3,1,2,68,1,0)
> qqnorm(muffins, ylab = "Muffins sold per day")
> qqline(muffins)




There a nice looking line including most of the points except one! That one is not normal(that is: not normally distributed).
Plotting this graph is also very useful to spot those strange values.

2. Lets calculate the Confidence intervals for that distribution:

2.1 WITH the un-normal point

> t.test(muffins)
95 percent confidence interval:
-12.26252 34.26252

Say what? Negative even...This does not make sense... This happens, because of that un-normal point. We cannot use this calculation in data that is not normally distributed!

Lets try now without that point.

2.1 WITHOUT the un-normal point.

In Statistics language the un-normal point would be called an outlier, and one common trick, is just to eliminate it from the data. Because is just too out to be valid, is most likely an error of measurement or an un-normal event.

So now we get:

> muffins_clean = c(2,3,1,2,1,0)
> t.test(muffins_clean)
95 percent confidence interval:
0.3993426 2.6006574

Ok, you see, this makes more sense. It says that 95% percent of times, the values of sold muffins will be between 0.3993426 and 2.6006574. Rounding up means between 0 and 3.

if we want to have a smaller gap(but covering less cases), we can do:
> t.test(muffins_clean, conf.level = 0.75)
75 percent confidence interval:
0.9429669 2.0570331

thus, 75% of days sales will be between 1 and 2 muffins. Which makes a lot more sense than the originally estimated 11!!


A rule of thumb is: the more accurate we want to be the more data we should include in the test.


This is part 3 of a 3 parts brief introduction to statistics. Here for part 2 and here for part 1

Statistics Part 2, the basic measures

Basic Measures

This is the 2nd of a 3 parts brief introduction to statistics. Here for part 1 and here for part 3.


Wikipedia definition for statistics: Statistics is a mathematical science pertaining to the collection, analysis, interpretation or explanation, and presentation of data.

And its the absolute basis for data mining! So a needed subject for the aspiring data miners.

A common situation is that you have a bunch of numbers from which you want to make a summary of, as in understand which numbers keep appearing more frequently(Descriptive statistics), what is the central value(average), but also maybe use it as a prediction on what will happen in future(Inferential statistics)...

These numbers are called sample data from an experiment. And it could be any number of things:
  • the amount of muffins sold in a bakery,
  • the number of times people click on "buy now" button from your online bakery,
  • the number of purple cars on sunday's,
  • the number of drunk people at 1 am versus at 5 am,
  • how many hours your cat sleeps per day,
  • how many shots on target does a basketball player achieve.

Some of these, if we create a statistic from them, can prove to be useful. For example, suppose you can only buy eggs at the market on the weekend. And you are wondering how many eggs are needed for cooking muffins for 1 full week. Because if you order too many eggs they will get bad and you have to throw them away, if you order too less, you don't have enough for the whole week.
So, if you create a statistic of how many muffins you sold last week, you will see how many eggs you will need to buy.

Using R software.
We will go trough some methods used for analyzing numbers (making statistics) using R software, which was built exactly for kind of thing!


Basic Statistic concepts
There's a few basic statistical concepts to be aware of.

Measures of central tendency
The best way to reduce a set of data and still retain part of the information, is to summarize the set with a single value.

mean
answears: what is the average from all the numbers?
> mean(c(10,11,12,9,13,11,11))
11

median
answears: (if we put all the numbers in order) what is the value in the middle?
> median(c(10,11,12,9,13,11,11))
11

mode
answears: what is the value that appears more frequently?
make an histogram for example:
> hist(c(10,11,12,9,13,11,11))
look at the graph and you can see the bigger bar means that 11 shows more frequently.



Measures of spread
While the measures of location attempt to identify the most representative value in a set of data a data description is not complete until the spread variability is also known. In fact, the basic numerical description of a data set requires measures of both centre and spread.

Range
answears: how broad is the interval of the min and max values in the data?
> range(c(10,11,12,9,13,11,11))
9 13

box plots
answears: visual representation of the data spread
> boxplot(c(10,11,12,9,13,11,11))



variance
answears: how far from the values the average are all the values?
> var(c(10,11,12,9,13,11,11))
1.66667

standard deviation
answers: how far from the values the average are all the values? normalized, that is, transformed into a value more easily interpreted.
> sd(c(10,11,12,9,13,11,11))
1.29

Measures of association
measures how strongly 2 sets of data are associated.

Covariance,
answers: how much two variables change together.
> cov(1:10, 5:14)
9.1667

Correlation, between 0 and 1
answers: how much two variables change together, like the covariance, but transformed to be more easily interpreted.
> cor(1:10, 5:14)
1

Correlation values are quickly grasped by looking at following graph:


This is the 2nd of a 3 parts brief introduction to statistics. Here for part 1 and here for part 3.

Once upon a time...

Another audio video experiment, now with voice.

Statistics Part 1, the Muffins Story

What are statistics good for?




Imagine you own a bakery and you have to figure out how many chocolate muffins you should cook per day.

So you go and talk to the chef:

You: How many muffins did we sold last week?
Chef: 77
You: Ok.
You(think): so if we sold 77 muffins in a week, i just need to calculate the average for each day
and i will know how many muffins we have to cook! So 77 in 7days(77/7=1), is 11 per day!
You: please cook 11 per day.
Chef: So many? are you sure?
You(think): Well, why not?, i calculated the average per day, everybody knows that the average is the
correct way to go, what could be wrong??
You: Yes im sure! cook 11!
... FAIL...


What could be wrong?

Example 1

If the amount of muffins sold per day was like this:

{ monday:10, tuesday:11, wednesday:12, thursday:9, friday:13, saturday:11, sunday:11 }

Then if you do an average:

(10+11+12+9+13+11+11)/7=11

You get 11 per day. And by looking at the daily results it is reasonable to expect that you will sell around 11 per day in future. All cool.

Example 2

But what really happened is that last week, was that on friday, there was a school visit to the bakery so the real amount of muffins sold per day, or all the week, were:

{ monday:2, tuesday:3, wednesday:1, thursday:2, friday:68, saturday:1, sunday:0 }

Thus, doing the average:

(2+ 3+ 1+ 2+ 68+ 1+ 0)/7=11

The average is 11 per day, so should the chef cook 11 per day? No! You can clearly see that baking 11 will be too much, even thought thats what average says!! So sometimes a simple average does not tells the whole store and thats what statistics are for, to avoid these mistakes and be to able to calculate reliable predictions.

Hope that give some motivation into why learning statistics is useful.

This is part 1 of 3, part 2 is here and part 3 here.

Web Analytics Talk

From Google Tech Talk a good talk about web analytics.



Liked that Matt started with: "There is no accuraccy with Web Analytics Tools!". Web analytics tools have too many variables where is easy to fail to be taken as an accurate tool. Attention, this does not mean they are useless, far from that, it just means that we have always to look at trends and not the absolute values!


Another highlight is the "Analytics According to Captain Kirk" where he goes into doing some analysis to find patterns about the deaths in Star Treck TV show. And comes up with some interesting revelations:
  • Crew members with a red shirt have higher probability to die.
  • Most of deaths occur when beaming down with Kirk
  • When Kirk meets an alien babe the death rate drops significantly.
And while this is just a funny example, i think it captures well the essence of what is analytics/data mining all about!!

Art in Bronze

"a.sáxeo sculptures are a tribute to the human intelligence that in the third millenium B.C. combined bee wax, mud, copper, tin and fire to express emotions in a magical art form."



a.sáxeo's site

Ebay Stocks Price Alert

Tired of checking stock price value every day? If you are waiting for a stock price to rise to a certain value to sell it, or waiting for the price to drop to a certain value to buy it, here you go, ruby scripting to the rescue.

This script will send you an email when stock price falls off a certain tresholds.

Set your own thresholds changing the "low" & "high" values.
Set your preferred stock symbol changing "EBAY" to "GOOGLE" for example. Is easeally extendable to support several stock symbols. I can post a second script if there's enough demand.

To set it up running every day, use windows Task Scheduler or CRON for linux and mac machines.

It does require the ruby mailfactory gem, use "gem install mailfactory" to get it on your machine, and also add in your email server configuration.


 require 'net/http'
require 'net/smtp'
require 'mailfactory'

def get_stock_quote(symbol)

host = "finance.google.com"
link = "/finance?q="+ symbol.upcase

begin

# Create a new HTTP connection
httpCon = Net::HTTP.new( host, 80 )

# Perform a HEAD request
resp = httpCon.get( link, nil )

value = (resp.body.scan /class="pr"[^>]*([^<]*)/).flatten.to_s.gsub(">", "").to_f

print " current #{symbol} stock price is " + value.to_s + " (from finance.google.com)\n"
return value
end

end


def send_stock_alert(subject="Ebay Stock Alert", message="Value Alert")

begin
mail = MailFactory.new()
mail.to = 'myself@gmail.com'
mail.from = "ebaystock@alerts.com"
mail.subject = subject
mail.html = message

Net::SMTP.start('yourmailserver') do |smtp|
smtp.send_message(mail.to_s(), mail.from, mail.to )
end

rescue StandardError => e
puts "Error sending mail"
raise e
end

end

begin

low = 23.0
high = 27.0
name = "EBAY"
price = get_stock_quote(name)

send_stock_alert("#{name} Stock Alert: #{price}") unless price.between? low, high

end

Turn Your Point-And-Shoot into a Super Camera

For those who have compact Canon cameras there's a quite awesome thing here

Check also , this link for more detailed usage.

Ebay Find

A small ruby script that does searches in ebay. Is capable of:
  • introducing some misspells in order to find those misspelled items that might be a great bargain(because nobody else is able to find them)
  • and also is capable of searching more than one ebay store.
Prints out to console the search results. Of course because is in ruby, is easy to modify and play around :)

Quick and dirty pulled together, but posting here in case someone else finds it useful.

Use like this:
for multiple keyword search:
ruby ebayfind.rb "Ibanez guitar"

for single keyword search:
ruby ebayfind.rb Ibanez

Code:


require 'net/http'
require 'rexml/document'
require 'active_support'


class Misspell

 def self.a(a_text="word", options=["skip_letter", "double_letters",
   "reverse_letters","skip_spaces","missed_key","inserted_key" ])

    ## Sets up the params    params = Hash.new
   params["user_input"] = a_text
   options.each do |opt|
     params[opt] = opt
   end

    #executes a call    res = Net::HTTP.post_form(URI.parse('http://tools.seobook.com/spelling/keywords-typos.cgi'), params)

    #cleans and formats results    res.body.gsub("\n", ',').scan(/<textarea rows[^>]*(.*)<\/textarea>/).flatten.to_s.gsub(">", "").split(',')
 end
end



if ARGV.length == 0
 puts "#{$0}: You must enter at least one argument."
 exit
end

search_str = [ARGV[0]]

puts "Finding: #{search_str}"
output = ""

#bay_stores = {"1"=>"US", "3"=>"GB", "77"=>"DE", "71"=>"FR", "186"=>"ES", "146"=>"NL"}bay_stores = {"3"=>"GB", "77"=>"DE", "71"=>"FR", "186"=>"ES", "146"=>"NL"}

puts "Creating the misspells..."
search_str << Misspell.a(search_str, ["skip_letter", "double_letters", "reverse_letters","skip_spaces","missed_key","inserted_key" ])
search_str.flatten!
puts "Search item with added misspells: #{search_str.join(",")}"

#Iterate through each Ebay storebay_stores.each_key do |siteid|

  # Iterate through each  search_str.each do |query_string|

    # Put together an eBay parameter string    ebay_params = {'callname'                     =>'FindItemsAdvanced',
     'appid'                       =>'TODO:_YOUR_OWN_EBAY_API_ID',
     'version'                     =>'553',
     'responseencoding'            =>'XML',
     'siteid'                      =>siteid,
     'MessageID                   '=>'',
     'BidCountMax                 '=>'',
     'BidCountMin                 '=>'',
     'CategoryHistogramMaxChildren'=>'',
     'CategoryHistogramMaxParents '=>'',
     'CategoryID                  '=>'',
     'CharityID                   '=>'',
     'Condition                   '=>'',
     'Currency                    '=>'',
     'DescriptionSearch           '=>'',
     'EndTimeFrom                 '=>'',
     'EndTimeTo                   '=>'',
     'ExcludeFlag                 '=>'',
     'FeedbackScoreMax            '=>'',
     'FeedbackScoreMin            '=>'',
     'GroupMaxEntries             '=>'',
     'GroupsMax                   '=>'',
     'IncludeSelector             '=>'',
     'ItemsAvailableTo            '=>'',
     'ItemsLocatedIn              '=>'',
     'ItemSort                    '=>'',
     'ItemType                    '=>'AllItemTypes',
     'MaxDistance                 '=>'',
     'MaxEntries                  '=>'',
     'ModTimeFrom                 '=>'',
     'PageNumber                  '=>'',
     'PaymentMethod               '=>'PayPal',
     'PostalCode                  '=>'',
     'PreferredLocation           '=>'',
     'PriceMax                    '=>'',
     'PriceMin                    '=>'',
     'ProductID                   '=>'',
     'Quantity                    '=>'',
     'QuantityOperator            '=>'',
     'QueryKeywords               '=> (query_string.gsub(' ', '%20')),
     'SearchFlag                  '=>'',
     'SellerBusinessType          '=>'',
     'SellerID                    '=>'',
     'SellerIDExclude             '=>'',
     'SortOrder                   '=>'',
     'StoreName                   '=>'',
     'StoreSearch                 '=>''
   }.map { |key,value| "#{key.strip}=#{value}" unless value.empty? }.join("&").squeeze('&')

    # Ask eBay what it knows about our query_string    ebay_response = Net::HTTP.get_response('open.api.ebay.com', '/shopping?' << ebay_params)

   xml = REXML::Document.new(ebay_response.body)

    # Get basic information    response3 =  Hash.from_xml(xml.to_s)

   xml.root.elements.each("/FindItemsAdvancedResponse/SearchResult/ItemArray/Item") do |element|
     item =  Hash.from_xml(element.to_s)
     puts ""
     puts ">> Searching for: #{query_string}, in #{bay_stores[siteid]}, got #{response3["FindItemsAdvancedResponse"]["TotalItems"]} results"
     puts item['Item']['Title']
     puts item['Item']['EndTime']
     puts item['Item']['ConvertedCurrentPrice']
     puts item['Item']['GalleryURL']
     puts item['Item']['ListingType']
     puts item['Item']['Condition']
     puts item['Item']['ViewItemURLForNaturalSearch']
   end
 end
end