Building a community of Data Scientists.
Passionate about Data Science.
Posted: August 18th, 2012 | Author: | Filed under: Data Science, Data Scientist, Data Visualization | Tags: data visualization, music data, music data science | No Comments »

What was your background prior to entering this competition?
I am the CEO of Musicmetric, a music analytics company I founded along with two others in 2008. We supply data and insights to the music industry via dashboards and API’s. We track consumer activity on social networks, peer-to-peer networks, blogs and news sources and combine this with sales, radio airplay and streaming data. My academic background is computational physics.
What made you decide to enter the Music Data Science Hackathon?
I was originally going to enter the recommendation system part of the competition, but I realised I don’t know enough about that to have much chance against the experts that entered! The data visualisation part gave a nice opportunity to explore the consumer data in a different way and to make it more visually accessible, so I went for that.
What approach you followed to generate the visualization?
The dataset was an anonymised set of questionnaire responses about different artists, such as “do you like this artist” or “use a word to describe the artist”.
First thing was sorting out the data. The artists were anonymised so this wouldn’t make for a very interesting visualisation if they were artist centric (plotting artist ID’s with no context). So I wrote some scripts to transform the data to be word and demographic centric (‘words’ are the words used by respondents to describe the artists). Once that pre-processing was complete I ran some analysis:
1. Working out the words most used by females compered to males and vice versa: This was a pretty simple script to compare the number of times each word was used by each gender, and rank based on the difference. The results were quite interesting – males tend to use negative words like “boring” and “unoriginal” much more and females use descriptive words like “beautiful” and “distinctive”.
2. Comparing artist rating to the sentiment of the words used to describe it: The respondents (users) to the questionnaire were asked to give the artist a rating between 0 and 100 on how much they liked them, along with a list of words to describe them. I used the Musicmetric Sentiment Analysis API to analyse the sentiment of the words used and plotted these against the user rating of the artist in question. The results are quite a linear fit, which shows that people use negative words to describe artists they don’t like. Pretty obvious – but it helps validate the usefulness of the dataset.
3. Looking at average artist rating based on users job category. I wanted to see if job type influenced music taste at all, by plotting the average artist rating based on the job category of the person rating the artist. The results show a distribution, with retired people liking the set of artists the most. Since the data was a subset of the full dataset, I’d need to test this on more to fully validate the significance. But it helped fill some space on the infographic…
4. Clustering the words based on artists linked to them. Because I had transformed the data to be word centric, I created a vector for each word that used the artists as dimensions, with the number of times that word was linked to the artist as the value for each dimension. I then used K-means clustering to visualise if there were similarities between the words based on their overlap of use to describe artists. It turns out there are similarities, with words such as Beautiful, Sensitive, Passionate and Inspiring forming a cluster and Superficial, Unattractive, Cheap, Unoriginal forming another. I think The Hierarchical dendrogram below the cluster plot shows the similarity between the words much better (the closer two words are to each other, the more similar they are) so I put that in too.
5. Finally, because everyone likes a map, I plotted the percentage of respondents who own all the music on EMI based on UK region.
What was your most important insight into the data?
The fact that a lot of the results made sense by inspection helps validate the usefulness of the answers supplied by the users. If they are consistently using negative sentiment words to describe artists they don’t like, then it shows they are not giving fake answers. Likewise by showing that similar words do actually cluster based on the artists they were used to describe shows that the data is useful for creating a recommendation system.
The fact that females and males use different words to describe artists could also be taken into account when designing a recommendation system.
Were you surprised by any of your insights?
A lot of the insights are fairly obvious looking at them, but I was surprised that this happened. A lot of datasets don’t create such nice results!
Which data visualization tools and programming language did you use?
Everything was done in Python, but I used Orange Visualisation to explore some of the clustering parameters more efficiently.
What have you taken away from the Music Data Science Hacakthon?
From the results of the rec-sys part of the competition, it seemed that simple is better. Segmenting by loads of different parameters, or trying to craft a perfect rule based model didn’t work as well as throwing the data into an ensemble machine-learning model and letting it run (the method that the winner used). That being said, I don’t know a huge amount about recommendation systems so with more time and research that probably isn’t the case, plus those models may not be scalable. But 24 hours is quite a short time to build such complexity from scratch…
What did you think of the 24 hour hackathon format?
It was good, and the venue was great too. I’m looking forward to the next one.
Here is my Data Visualization:
Posted: August 4th, 2012 | Author: | Filed under: Data Science, Data Scientist | 2 Comments »

What was your background prior to entering this competition?
We are a team focused on data mining from Shanda Innovations, a tech incubator of Shanda Corporation from China. It’s a global leading interactive entertainment media group . We all graduated from top tier universities in China, majored in Computer Science, and then started our career in Chinese IT companies. Right before the EMI Competition, we was awarded second place in ACM KDD-Cup 2012.
What made you decide to enter this competition?
We would love to put ourselves on the international stage and to compete as well as to share our knowledge of data mining with peers from all over the world. We believe participating in this competition would be a precious opportunity to “meet” all the talents in this field. In addition, the competition was very exciting and challenging. Given a tight time constraint of the competition, we viewed it as a chance for overnight team building.
What preprocessing and machine learning methods did you use?
Some preprocessing was given to Words.txt. We mapped the words that users chose to describe artists to some keyword IDs and used these IDs in the logistic regression model, which greatly improved the performance.
The main machine learning methods were SVD++ and Logistic Regression.
What was your most important insight into the data?
For most users the training data was very sparse. Therefore, we should integrate more features from other aspects. For instance, the user profiles, words they chose, and survey results would be very valuable.
Sometimes it is easy for us to trap in a fixed mindset and may ignore some potential important indicators. Keeping our mind open is easier said than done.
Were you surprised by any of your insights?
We were very surprised to find that the variation of the track scores given by different people was a lot more than we expected. For instance, User ID 41072 scored 100 to track 156 whereas User ID 41286 gave merely 4 to the same track! It was very interesting to find that people were so different in music preference and we believed that was why so many different types of music existed. By making a further in-depth data analysis we may discover more on the music interests of people.
Which tools and programming language did you use?
The programming languages include C++ and Python. We would like to express our gratitude to APEX team, the author of an open source toolkit called SVDFeature used in our solution.
What have you taken away from the Music Data Science Hackathon?
We had a lot of fun and our team became more united . What is more important is that we came to think about a broader area of data mining application. It is interesting to see our technology being used in other industries, for example the music and entertainment industry this time. This is an eye-opening experience that brought a lot of sparkles to our routine work.
What did you think of the 24 hour hackathon format?
It was a very exciting 24-hour roller coaster experience. We would appreciate the organizers for holding this interesting and meaningful competition. Looking forward to the next round!
Here is our approach
And here is our code in github
https://github.com/fancyspeed/codes_of_innovations_for_emi/tree/master/code_of_innovations
Posted: August 4th, 2012 | Author: | Filed under: Data Science, Data Scientist | 1 Comment »

What was your background prior to entering this competition?
I got my PhD in Computer Science from the Southeast University in China 10 years ago, and then moved to Singapore to work as a postdoc research fellow under the supervision of Prof Wee Sun Lee. It was very kind of him to send me to the first-ever Machine Learning Summer School in 2002 – that’s when I discovered the fascinating world of machine learning. Since then my research has been centred around using statistical machine learning techniques to improve information retrieval and organisation. I am currently a Senior Lecturer in Computer Science at Birkbeck, University of London. I have also joined the Royal Statistical Society and learned loads of interesting stuff done by statisticians.
What made you decide to enter?
Why not? It only costs one day. I have played in a couple of Kaggle competitions, but never participated in any hackathon before. Concentrating my mind on a problem continuously for 24 hours (well, except for a few quick naps) certainly sounded an interesting sporty challenge. The timing couldn’t be better: the College’s summer vacation just started; the whole country was in the atmosphere of sports (the tears for Andy Murray, the cheers for Bradley Wiggins, and of course the anticipation of London 2012 Olympics).
What preprocessing and machine learning methods did you use?
I manually cleaned and encoded the data files just using some Unix tools (cat, cut, split, grep, sort, wc, etc.) and a text editor (search, replace, etc.). I chose the machine learning method Random Forest to attack this problem, inspired by its great success in a number of data mining competitions. Once again, Random Forest proved to be amazingly powerful for complex classification/regression problems with many different types of features.
What was your most important insight into the data?
When I added more and more features into the Random Forest model, the predictive performance kept improving, though each time only a little bit. The final model includes all features available in the given dataset. The contribution of each individual feature might be small, but altogether they could make a noticeable difference. Maybe as Tesco says, “every little helps”!
Were you surprised by any of your insights?
I was a little bit surprised, as I thought that at least some demographic features would be irrelevant.
Which tools and programming language did you use?
I only used Python for coding in this competition. The implementation of Random Forest was from the Python machine learning library scikit-learn.
What have you taken away from the Music Data Science Hackathon?
It turned out to be an even more rewarding experience than I had expected. Since I attended the event in the London venue, I was not only able to receive a local prize, but also meet other local data scientists and establish industrial collaborations for the cloud computing course that I’ll teach. There was a burst of my LinkedIn connections after the hackathon. It’s thrilling to see such a vibrant data science community in London which includes software engineers, business analysts, market researchres, entrepreneurs, students and professors.
What did you think of the 24 hour hackathon format?
The 24 hour hackathon format opens opportunities to those who are more time-constrained, such as busy academics, therefore it helps to widen participation in the sport of data science. It is indeed meaningful to see how much knowledge can be dug out from a new dataset within a day, but it is not necessary to stop there. How about starting a data mining competition with a hackthon, and then continue running it over a few months? I hope to see more and more data mining competitions adopt a hackthon as its first phase.
Here is my approach
And here is my code in github
https://github.com/dell-zhang/zmusic_code
Posted: August 4th, 2012 | Author: | Filed under: Data Science, Data Scientist | No Comments »

What was your background prior to entering this competition?
I’ve just graduated from college with a bachelor degree in statistics. I’m also a 3rd prize winner in a previous competition “bioresponse” hold at kaggle. I’m interested in many fields of machine learning and artificial intelligence.
What made you decide to enter?
I thought this would be a quick game and won’t take too much time. Also the data is relatively small comparing to KDD Cup so I can play with it a little bit more and try more models. And yes, I enjoy games.
What preprocessing and machine learning methods did you use?
When I got the data files, the first thing I did is look through the files and try to figure out what they are. Then I try to put them in the form that machines could understand. For example, replacing ‘Fair’ ‘Good’ ‘Excellent’ with 1 2 3, or making dummy variables representing different artists and so on. After preprocessing, machine-ready data files are feed to the models. For this time I used libFM and gbm.
What was your most important insight into the data?
From the discussion in the forum after the competition, I find that many people have achieved very good score without using any ordinary CF models. No matrix factorization, nearest-neighborhood or factor model method, just plain random forest or other general-purposed models. There is not a big difference as I thought to be between formal CF models and others. I’m a little surprised of this.
Were you surprised by any of your insights?
Yes. I still haven’t really understand the phenomenon I just mentioned. Maybe it’s because sufficient data are provided so that we need not to find those latent factors. The value of factors may be already provided in the tags of music. That may be an explanation.
Which tools and programming language did you use?
I use R and python. R is used mainly for computing and statistics, while python is mainly for quick programming and scripting.
What have you taken away from the Music Data Science Hackathon?
It’s a very good experience to participate in a competition like this. You can learn many things that you would have never learnt from textbooks or papers. You can also have real feeling of how data fit to the model, and at the same time it’s a good chance to benchmark you skills.
What did you think of the 24 hour hackathon format?
Personally I think it’s more exciting and thrilling than a 2 month ‘marathon’
People may feel tired during a long competition,although a short one is neither an easy job, since you have to dash from the start line to the end line like that in a 100m match, with all your focus and try not to make mistakes if you are eager to win. However, it’s also more enjoyable if you have an athletic spirit, aiming to be Faster, Higher and Stronger.
Here is my approach
And here is my code in github
https://github.com/lns/music-hackathon
Posted: August 4th, 2012 | Author: | Filed under: Data Science, Data Scientist | No Comments »

What was your background prior to entering this competition?
I’m a Research Fellow in glaciology at the University of Michigan. Prior to this I’ve been involved in a number of Kaggle competitions, including mapping dark matter, automated essay scoring, and predicting shopper behaviour.
What made you decide to enter?
As I said, I’m a regular Kaggle competitor, and I enjoyed the opportunity to do something quick and fun, rather than getting bogged down in a multi-month project.
What preprocessing and machine learning methods did you use?
I did some very minimal pre-processing to get the data into a usable form, then threw a wide variety of machine learning algorithms at it. I had the most success with random forests.
What was your most important insight into the data?
It’s really important to look at individual user biases. A 50 from one person could be the equivalent of a 70 from someone else. It’s also really effective to divide things up by artist. You can get away with very simple models if you have a separate one for each artist.
Were you surprised by any of your insights?
I was really surprised how little demographic factors mattered. I assumed that age and gender would be really good predictors of musical taste, but they turned out to be not so great.
Which tools and programming language did you use?
I wrote all my code in R, except for some C extensions.
What have you taken away from the Music Data Science Hackathon?
I really need to read up on modern collaborative filtering techniques. I managed to do okay with just generic machine learning algorithms, and some basic stuff I picked up from the Netflix Prize papers, but there’s a lot more stuff out there nowadays which would have made things easier.
What did you think of the 24 hour hackathon format?
I really enjoyed it. Looking forward to the next one!
Here is my approach
And here is my code in github
https://github.com/mewo2/musichackathon
Posted: August 3rd, 2012 | Author: | Filed under: Data Science, Data Scientist | No Comments »
I had a lot of fun working as the Data Team Leader in this hackathon, and I really enjoyed being part of this project since day one. Here is a brief report and some of my notes.
The Hackathon. The Music Data Science Hackathon has been one of the most successful Data Science London events to date; it attracted over 175 data scientists across the world organised in 138 teams. The contestants posted 1,399 submissions. The task involved taking a large sample of EMI’s market research data and using it to develop models that can predict how much customers are going to like a particular artist’s track, by accurately estimating the ratings that they give to them.
There were two main benefits of structuring the hackathon’s task this way. First, it gives us a very quick glimpse into the goldmine of data that EMI has collected through their market research: this task actually used a small sample of all the data they have, and many participants have already asked whether they can use the data, outside of the competition, in their research. It also was structured around a well-known and fun machine-learning problem (rating prediction), which allowed participants to quickly apply their expertise to the problem at hand.
The participants included a range of academic staff, research students, machine learning enthusiasts, and employees of tech start-ups; a quick look at the top-10 of the leader board also shows a number of different countries.
I’m particularly happy to have seen so many familiar names in the leader board, including the great folk from MusicMetric, Dell Zhang (who won the London on-site competition), Tamas and a team from UCL Computer Science, and friends for Germany, such as Zeno Gantner, who has developed the MyMediaLite tool.
The machine learning approaches that were used in the competition cover a number of state-of-the-art algorithms. Some of them are becoming familiar names in the context of Kaggle competitions; others gained notoriety and research attention throughout the $1 million Netflix competition. The most actively discussed approaches included:
- Matrix Factorization (& other collaborative filtering approaches)
- Random Forests
- Gradient Boosted Trees (GBTs)
- Restricted Boltzmann Machines (RBMs)
Twenty-four hours is a very short time to tackle data mining problems; the hackers used a mixture of their own coding skills (using languages like R and Python) and freely accessible machine learning tools. Many of the participants are also actively sharing their approaches to the problem on the Kaggle forum. Two tools that were most discussed in the forum were:
MyMediaLite: http://www.ismll.uni-hildesheim.de/mymedialite/
LibFM: http://www.libfm.org/
Although participants were mostly focused on improving their models’ accuracy, a number of insights about the data and its predictability emerged. Some of them reflect well-known aspects of rating-style data (e.g. respondents tend to anchor toward particular values). However, one of the biggest highlights relates to the novelty of the dataset that was released: How much does demographic data count? There were mixed comments about the utility of using demographic and user profile data in order to improve the accuracy of rating prediction. For example, one participant found that a person’s working status is not indicative of their musical preference. Other participants used all the data as features of their models, so finding the answer to this question would still require a bit of research!
Dr. Neal Lathia is a Research Associate in the Networks and Operating Systems Group of Cambridge University’s Computer Laboratory.
Recent Comments