What was your background prior to entering this competition?
I’m a Research Fellow in glaciology at the University of Michigan. Prior to this I’ve been involved in a number of Kaggle competitions, including mapping dark matter, automated essay scoring, and predicting shopper behaviour.
What made you decide to enter?
As I said, I’m a regular Kaggle competitor, and I enjoyed the opportunity to do something quick and fun, rather than getting bogged down in a multi-month project.
What preprocessing and machine learning methods did you use?
I did some very minimal pre-processing to get the data into a usable form, then threw a wide variety of machine learning algorithms at it. I had the most success with random forests.
What was your most important insight into the data?
It’s really important to look at individual user biases. A 50 from one person could be the equivalent of a 70 from someone else. It’s also really effective to divide things up by artist. You can get away with very simple models if you have a separate one for each artist.
Were you surprised by any of your insights?
I was really surprised how little demographic factors mattered. I assumed that age and gender would be really good predictors of musical taste, but they turned out to be not so great.
Which tools and programming language did you use?
I wrote all my code in R, except for some C extensions.
What have you taken away from the Music Data Science Hackathon?
I really need to read up on modern collaborative filtering techniques. I managed to do okay with just generic machine learning algorithms, and some basic stuff I picked up from the Netflix Prize papers, but there’s a lot more stuff out there nowadays which would have made things easier.
What did you think of the 24 hour hackathon format?
I really enjoyed it. Looking forward to the next one!
Here is my approach
And here is my code in github