What was your background prior to entering this competition?
We are a team focused on data mining from Shanda Innovations, a tech incubator of Shanda Corporation from China. It’s a global leading interactive entertainment media group . We all graduated from top tier universities in China, majored in Computer Science, and then started our career in Chinese IT companies. Right before the EMI Competition, we was awarded second place in ACM KDD-Cup 2012.
What made you decide to enter this competition?
We would love to put ourselves on the international stage and to compete as well as to share our knowledge of data mining with peers from all over the world. We believe participating in this competition would be a precious opportunity to “meet” all the talents in this field. In addition, the competition was very exciting and challenging. Given a tight time constraint of the competition, we viewed it as a chance for overnight team building.
What preprocessing and machine learning methods did you use?
Some preprocessing was given to Words.txt. We mapped the words that users chose to describe artists to some keyword IDs and used these IDs in the logistic regression model, which greatly improved the performance.
The main machine learning methods were SVD++ and Logistic Regression.
What was your most important insight into the data?
For most users the training data was very sparse. Therefore, we should integrate more features from other aspects. For instance, the user profiles, words they chose, and survey results would be very valuable.
Sometimes it is easy for us to trap in a fixed mindset and may ignore some potential important indicators. Keeping our mind open is easier said than done.
Were you surprised by any of your insights?
We were very surprised to find that the variation of the track scores given by different people was a lot more than we expected. For instance, User ID 41072 scored 100 to track 156 whereas User ID 41286 gave merely 4 to the same track! It was very interesting to find that people were so different in music preference and we believed that was why so many different types of music existed. By making a further in-depth data analysis we may discover more on the music interests of people.
Which tools and programming language did you use?
The programming languages include C++ and Python. We would like to express our gratitude to APEX team, the author of an open source toolkit called SVDFeature used in our solution.
What have you taken away from the Music Data Science Hackathon?
We had a lot of fun and our team became more united . What is more important is that we came to think about a broader area of data mining application. It is interesting to see our technology being used in other industries, for example the music and entertainment industry this time. This is an eye-opening experience that brought a lot of sparkles to our routine work.
What did you think of the 24 hour hackathon format?
It was a very exciting 24-hour roller coaster experience. We would appreciate the organizers for holding this interesting and meaningful competition. Looking forward to the next round!
Here is our approach
And here is our code in github