I was one of the 170 or so people [220 actual count] at the Data Science London Hackathon over the weekend. As always this was well run by Carlos and his team who kept us fed, watered and connected to the Internet.
One of the three challenges involved a dataset containing pairs of Twitter users, A and B, where one of the pair had been ranked, a person, as more influential than the other (the data was provided by PeerIndex, an event sponsor). The dataset contained 22 attributes, 11 for each user of the pair, plus 0/1 to indicate who was most influential; there was a training dataset of 5.5K pairs to learn against and a test dataset to make predictions against. The data was not messy or even sparse, how hard could it be?
Talks had been organized for the morning and afternoon. While Microsoft (one of the event sponsors) told us about Azure and F#, I sat at the back trying out various machine learning packages. Yes, the technical evangelists told us, Linux as well as Windows instances were available in Azure, support was available for the usual big data languages (e.g., Python and R; the Microsoft people seemed to be much more familiar with Python) plus dot net (this was the first time I had heard the use of dot net proposed as a big data solution for the Cloud).
Some members of Team Outliers from previous hackathons (Jonny, Bob and me) formed a team and after the talks had finished the Microsoft people+partners were at our table (probably because our age distribution was similar to theirs, i.e., at the opposite end of the range to most teams; some of the Microsoft people got very involved in trying to produce a solution to the visualization challenge).
Integrating F# with bigdata seems to involve providing an interface to R packages (this is done by interfacing to the packages installed on a local R installation) and getting the IDE to know about the names of columns contained in data that has been read. Since I think the world needs new general purpose programming languages as much as it needs holes in the head I won’t say any more.
When in challenge solving mode I was using cross-validation to check the models being built and scoring around 0.76 (AUC, the metric used by the organizers). Looking at the leader board later in the afternoon showed several teams scoring over 0.85, a big difference; what technique were they using to get such a big improvement?
A note: even when trained on data that uses 0/1 predictor values machine learners don’t produce models that return zero or one, many return values in the range 0..1 (some use other ranges) and the usual technique is treat all values greater than 0.5 as a 1 (or TRUE or ‘yes’, etc) and all other values as a 0 (or FALSE or ‘no’, etc). This (x > 0.5) test had to be done to cross validate models using the training data and I was using the same technique for the test data. With an hour to go in the 24 hour hackathon we found out (change from ‘I’ to ‘we’ to spread the mistake around) that the required test data output was a probability in the range 0..1, not just a 0/1 value; the example answer had this behavior and this requirement was explained in the bottom right of the submission page! How many times have I told others to carefully read the problem requirements? Thankfully everybody was tired and Jonny&Bob did not have the energy to throw me out of the window for leading them so badly astray.
Having AUC as the metric should have raised a red flag, this does not make much sense for a 0/1 answer; using AUC makes sense for PeerIndex because they will want to trade off recall/precision. Also, its a good idea to remove ones ego when asked the question: are lots of people doing something clever and are you doing something stupid?
While we are on the subject of doing the wrong thing, one of the top three teams gave an excellent example of why sales/marketing don’t like technical people talking to clients. Having just won a prize from donated by Microsoft for an app using Azure, the team proceeded to give a demo and explain how they had done everything using Google services and made it appear within a browser frame as if it were hosted on Azure. A couple of us sitting at the back were debating whether Microsoft would jump in a disqualify them.
What did I learn that I did not already know this weekend? There are some R machine learning packages on CRAN that don’t include a predict function (there should be a research-only subsection on CRAN for packages like this) and some ranking algorithms need more than 6G of memory to process 5.5K pairs.
There seemed to be a lot more people using Python, compared to R. Perhaps having the sample solution in Python pushed the fence sitters that way. There also seemed to be more women present, but that may have been because there were more people at this event than previous ones and I am responding to absolute numbers rather than percentage.
As part of Big Data Week, we are happy to announce that Hilary Mason will be giving a talk at our @ds_ldn group meeting on April 23rd, 2012. If you are new to data science or an experienced data scientist, this is an opportunity for you to see a great speaker and one of the most respected data scientist in the startup community.
Hilary Mason is the Chief Scientist at bit.ly, where she finds sense in vast data sets. Her work involves both pure research and development of product-focused features.
She is a former computer science professor with a background in machine learning and data mining, has published numerous academic papers, and regularly releases code on her personal site, www.hilarymason.com.
She’s also a co-founder of HackNY (hackny.org), a non-profit organization that connects talented student hackers from around the world with startups in NYC.She has discovered two new species, loves to bake cookies, and asks way too many questions.
Watch Hilary talking at Strata 2011: “What Data Tells Us”
We’re starting to see a deluge of companies who businesses are all about making data analysis/science/insight “easy for the non-expert”. We’ve been here before, quite a few times sadly. When I started writing software 12 years ago, there was great excitement in the air – finally we could use tools to design software, then press a button that would create our whole beautiful design in code! Then we could just hire some barely-sentient code monkeys to fill in the ‘easy bits’ like method definitions and those pesky database access routines.
It was a disaster. The fundamental problem was that by the time you’d crafted your beloved design and polished it to a high shine, the world had moved on. What may have worked on day 1 of the project was now hopelessly inadequate. We should always remember the maxim “no plan survives contact with the enemy”, the enemy here being the shifting reality of what your software needs to deliver.
Another major problem with this approach was the proliferation of so-called Software Architects, beings of such insight and experience that they didn’t even need to code anymore! Since they didn’t code, they couldn’t experience the grinding pain of trying to jam their grandiose designs into a reality-shaped hole.
Fast-forward to today – data is big, Data Science is even bigger (as a buzzword anyway), and we’re all short of the right people. The answer, however, is not to make tools that hide the complex, ever-shifting reality of the analytical process. It’s to make people better at doing this stuff. And there’ll be no magic off-the-shelf solution that can achieve this, any more than giving a terrible golfer great clubs will make them win The Masters.
Early 2012 has seen the birth of two very interesting data related meetup series in London. The first one (chronologically) is the London Machine Learning Meetup that brings together academic faculty, graduate students and machine learning practitioners from industry. RangeSpan hosted the first meetup with some 30 participants. By now the community has grown to about 150 people, and around 80 people have turned up on the second event organised by PeerIndex. Each week we plan to host one talk reporting academic research, and another one reporting an interesting machine learning project from industry. Last week, we contributed the industry talk about machine learning at PeerIndex, see slides below:
The other new meetup, Data Science London kicked off with an inaugural meetup themed “So What Is Data Science?”. The community has very quickly grown to make this one of the three largest data science meetups on Earth (according to the organisers). In the first meetup we had 6 speakers, among them people from DataSift, Forward, PeerIndex (see our short slides below). Building on this momentum, in April, Londoners are going to organise the London Data Science Week complete with a data hackathlon.
The growing number of communities, activities and meetups in London reminded us what a wonderful place the city is for machine learning and data science. Not only does London have a host of startups and data-driven companies that apply data science in various domains (for example Last.fm, Mendeley, RangeSpan, Forward and of course PeerIndex), it also is home to a world-class academic machine learning community. UCL has an impressive list of machine learning and computational statistics celebrities including John-Shawe Taylor, Yee-Whye Teh, David Barber or Mark Girolami to name only a few, who teach amazing courses. Then there’s Imperial College with a strong tradition in computer science. Royal Holloway’s Computer Learning Research Centre boasts names like Vapnik and Chervonenkis, fathers of learning theory. And if this would not be enough, Cambridge and Oxford are only an hour train ride away. In addition to its illustrous faculty (Zoubin Ghahramani, Carl Rasmussen, David MacKay), Cambridge is also home to Microsoft Research’s machine learning and computer vision research labs, where among othersinfer.NET and the Xbox kinect are being developed. London, and UK in general, is on its way to emerge as a leading centre for machine learning, and we are very lucky to have our headquarters here.