In H P Lovecraft’s The Case of Charles Dexter Ward the villainous Curwen, having taken possession of the body of Charles Dexter Ward, uses a combination of chemistry and black magic to bring back from the dead the wisest people who have ever lived. He then tortures them for their secrets. Resurrection of the dead would be a bit of an over claim for machine learning but we can get some of the way: we can bring them back for book recommendations!
It’s a grandiose claim and you can judge for yourself how successful it is. Really I just wanted an opportunity to experiment with Apache Mahout and to brush up on my Java. I thought I’d share the results.
Apache Mahout is a scalable machine learning library written in Java (see previous post). The recommender part of the library takes lists of users and items (for example it could be all users on a website, the books they purchased and perhaps even their ratings of those books). Then for any given user, say Smith, it works out the users that are most similar to that user and compiles a list of recommendations based on their purchases (ignoring of course anything Smith already has). It can get more complicated but that’s the gist of it.
So all we need to supply the algorithm with is a list of users and items.
Casting about for a test data set I thought it might be interesting to take the influencers data set that I extracted from Wikipedia for a previous post. This data set is a list of who influenced who across all the famous people covered on Wikipedia. For example the list of influences for T. S. Elliot is: Dante Alighieri, Matthew Arnold, William Shakespeare, Jules Laforgue, John Donne, T. E. Hulme, Ezra Pound, Charles Maurras and Charles Baudelaire.
Why not use x is influencing y as a proxy for y has read x? Probably true in a lot of cases. We can imagine that the famous dead have been shopping on Amazon for their favourite authors and we now have the database. Even better we can tap into their likes and dislikes to give us some personalised recommendations.
It works (sort of) and was very interesting to put together. Here’s a demo:
Say you liked English Romantic poets. Then you might submit the names:
Click on the link to see what the web app returns. You’ll find that you need to hit refresh on your browser whenever you submit something. Not sure why at the moment.
Or you might try some painters you like:
Or even just …
Or you could upset things a bit by adding in someone incongruous!
Have a go. All you need to do is add some names separated by commas after the names= part in the http request. Note although my code does some basic things to facilitate matching, the names will have to roughly match what is in wikipedia so if you’re getting an error just look the person up there.
What are we getting back?
The first list shows, in Mahout terminology, your user neighborhood, that is individuals in wikipedia who are judged to have similar tastes to you. I have limited the output to 20 though in fact the recommender is set to use 80. I will explain this in more detail in later posts. They are not in rank order (I haven’t figured out how to do this yet.)
The second list is the recommendations derived from the preferences of your ‘neighbours’. Again I’ll explain this in more detail later (or you could look at Mahout in Action where it is explained nicely)
The code is available on github
Sometimes the app goes dead for some reason and I have to restart it. If that happens just drop me a line and I’ll restart it.
I plan to do several more posts on how this was done. At the moment I’m thinking
- A walk though of the code
- An explanation of how the recommender was tuned
- How it was deployed as a web app on Open Shift
But if anyone out there is interested in other things or has questions just let me know.