Wang Qing 2nd Prize Winner – EMI Music Data Science Hackathon

What was your background prior to entering this competition?

I’ve just graduated from college with a bachelor degree in statistics. I’m also a 3rd prize winner in a previous competition “bioresponse” hold at kaggle. I’m interested in many fields of machine learning and artificial intelligence.

What made you decide to enter?

I thought this would be a quick game and won’t take too much time. Also the data is relatively small comparing to KDD Cup so I can play with it a little bit more and try more models. And yes, I enjoy games.

What preprocessing and machine learning methods did you use?

When I got the data files, the first thing I did is look through the files and try to figure out what they are. Then I try to put them in the form that machines could understand. For example, replacing ‘Fair’ ‘Good’ ‘Excellent’ with 1 2 3, or making dummy variables representing different artists and so on. After preprocessing, machine-ready data files are feed to the models. For this time I used libFM and gbm.

What was your most important insight into the data?

From the discussion in the forum after the competition, I find that many people have achieved very good score without using any ordinary CF models. No matrix factorization, nearest-neighborhood or factor model method, just plain random forest or other general-purposed models. There is not a big difference as I thought to be between formal CF models and others. I’m a little surprised of this.

Were you surprised by any of your insights?

Yes. I still haven’t really understand the phenomenon I just mentioned. Maybe it’s because sufficient data are provided so that we need not to find those latent factors. The value of factors may be already provided in the tags of music. That may be an explanation.

Which tools and programming language did you use?

I use R and python. R is used mainly for computing and statistics, while python is mainly for quick programming and scripting.

What have you taken away from the Music Data Science Hackathon?

It’s a very good experience to participate in a competition like this. You can learn many things that you would have never learnt from textbooks or papers. You can also have real feeling of how data fit to the model, and at the same time it’s a good chance to benchmark you skills.

What did you think of the 24 hour hackathon format?

Personally I think it’s more exciting and thrilling than a 2 month ‘marathon’ :) People may feel tired during a long competition,although a short one is neither an easy job, since you have to dash from the start line to the end line like that in a 100m match, with all your focus and try not to make mistakes if you are eager to win. However, it’s also more enjoyable if you have an athletic spirit, aiming to be Faster, Higher and Stronger.

Here is my approach

And here is my code in github

Leave a Comment:

Data Science London © 2017