What was your background prior to entering this competition?
I got my PhD in Computer Science from the Southeast University in China 10 years ago, and then moved to Singapore to work as a postdoc research fellow under the supervision of Prof Wee Sun Lee. It was very kind of him to send me to the first-ever Machine Learning Summer School in 2002 – that’s when I discovered the fascinating world of machine learning. Since then my research has been centred around using statistical machine learning techniques to improve information retrieval and organisation. I am currently a Senior Lecturer in Computer Science at Birkbeck, University of London. I have also joined the Royal Statistical Society and learned loads of interesting stuff done by statisticians.
What made you decide to enter?
Why not? It only costs one day. I have played in a couple of Kaggle competitions, but never participated in any hackathon before. Concentrating my mind on a problem continuously for 24 hours (well, except for a few quick naps) certainly sounded an interesting sporty challenge. The timing couldn’t be better: the College’s summer vacation just started; the whole country was in the atmosphere of sports (the tears for Andy Murray, the cheers for Bradley Wiggins, and of course the anticipation of London 2012 Olympics).
What preprocessing and machine learning methods did you use?
I manually cleaned and encoded the data files just using some Unix tools (cat, cut, split, grep, sort, wc, etc.) and a text editor (search, replace, etc.). I chose the machine learning method Random Forest to attack this problem, inspired by its great success in a number of data mining competitions. Once again, Random Forest proved to be amazingly powerful for complex classification/regression problems with many different types of features.
What was your most important insight into the data?
When I added more and more features into the Random Forest model, the predictive performance kept improving, though each time only a little bit. The final model includes all features available in the given dataset. The contribution of each individual feature might be small, but altogether they could make a noticeable difference. Maybe as Tesco says, “every little helps”!
Were you surprised by any of your insights?
I was a little bit surprised, as I thought that at least some demographic features would be irrelevant.
Which tools and programming language did you use?
I only used Python for coding in this competition. The implementation of Random Forest was from the Python machine learning library scikit-learn.
What have you taken away from the Music Data Science Hackathon?
It turned out to be an even more rewarding experience than I had expected. Since I attended the event in the London venue, I was not only able to receive a local prize, but also meet other local data scientists and establish industrial collaborations for the cloud computing course that I’ll teach. There was a burst of my LinkedIn connections after the hackathon. It’s thrilling to see such a vibrant data science community in London which includes software engineers, business analysts, market researchres, entrepreneurs, students and professors.
What did you think of the 24 hour hackathon format?
The 24 hour hackathon format opens opportunities to those who are more time-constrained, such as busy academics, therefore it helps to widen participation in the sport of data science. It is indeed meaningful to see how much knowledge can be dug out from a new dataset within a day, but it is not necessary to stop there. How about starting a data mining competition with a hackthon, and then continue running it over a few months? I hope to see more and more data mining competitions adopt a hackthon as its first phase.
Here is my approach
And here is my code in github