What was your background prior to entering this competition?
I am the CEO of Musicmetric, a music analytics company I founded along with two others in 2008. We supply data and insights to the music industry via dashboards and API’s. We track consumer activity on social networks, peer-to-peer networks, blogs and news sources and combine this with sales, radio airplay and streaming data. My academic background is computational physics.
What made you decide to enter the Music Data Science Hackathon?
I was originally going to enter the recommendation system part of the competition, but I realised I don’t know enough about that to have much chance against the experts that entered! The data visualisation part gave a nice opportunity to explore the consumer data in a different way and to make it more visually accessible, so I went for that.
What approach you followed to generate the visualization?
The dataset was an anonymised set of questionnaire responses about different artists, such as “do you like this artist” or “use a word to describe the artist”.
First thing was sorting out the data. The artists were anonymised so this wouldn’t make for a very interesting visualisation if they were artist centric (plotting artist ID’s with no context). So I wrote some scripts to transform the data to be word and demographic centric (‘words’ are the words used by respondents to describe the artists). Once that pre-processing was complete I ran some analysis:
1. Working out the words most used by females compered to males and vice versa: This was a pretty simple script to compare the number of times each word was used by each gender, and rank based on the difference. The results were quite interesting – males tend to use negative words like “boring” and “unoriginal” much more and females use descriptive words like “beautiful” and “distinctive”.
2. Comparing artist rating to the sentiment of the words used to describe it: The respondents (users) to the questionnaire were asked to give the artist a rating between 0 and 100 on how much they liked them, along with a list of words to describe them. I used the Musicmetric Sentiment Analysis API to analyse the sentiment of the words used and plotted these against the user rating of the artist in question. The results are quite a linear fit, which shows that people use negative words to describe artists they don’t like. Pretty obvious – but it helps validate the usefulness of the dataset.
3. Looking at average artist rating based on users job category. I wanted to see if job type influenced music taste at all, by plotting the average artist rating based on the job category of the person rating the artist. The results show a distribution, with retired people liking the set of artists the most. Since the data was a subset of the full dataset, I’d need to test this on more to fully validate the significance. But it helped fill some space on the infographic…
4. Clustering the words based on artists linked to them. Because I had transformed the data to be word centric, I created a vector for each word that used the artists as dimensions, with the number of times that word was linked to the artist as the value for each dimension. I then used K-means clustering to visualise if there were similarities between the words based on their overlap of use to describe artists. It turns out there are similarities, with words such as Beautiful, Sensitive, Passionate and Inspiring forming a cluster and Superficial, Unattractive, Cheap, Unoriginal forming another. I think The Hierarchical dendrogram below the cluster plot shows the similarity between the words much better (the closer two words are to each other, the more similar they are) so I put that in too.
5. Finally, because everyone likes a map, I plotted the percentage of respondents who own all the music on EMI based on UK region.
What was your most important insight into the data?
The fact that a lot of the results made sense by inspection helps validate the usefulness of the answers supplied by the users. If they are consistently using negative sentiment words to describe artists they don’t like, then it shows they are not giving fake answers. Likewise by showing that similar words do actually cluster based on the artists they were used to describe shows that the data is useful for creating a recommendation system.
The fact that females and males use different words to describe artists could also be taken into account when designing a recommendation system.
Were you surprised by any of your insights?
A lot of the insights are fairly obvious looking at them, but I was surprised that this happened. A lot of datasets don’t create such nice results!
Which data visualization tools and programming language did you use?
Everything was done in Python, but I used Orange Visualisation to explore some of the clustering parameters more efficiently.
What have you taken away from the Music Data Science Hacakthon?
From the results of the rec-sys part of the competition, it seemed that simple is better. Segmenting by loads of different parameters, or trying to craft a perfect rule based model didn’t work as well as throwing the data into an ensemble machine-learning model and letting it run (the method that the winner used). That being said, I don’t know a huge amount about recommendation systems so with more time and research that probably isn’t the case, plus those models may not be scalable. But 24 hours is quite a short time to build such complexity from scratch…
What did you think of the 24 hour hackathon format?
It was good, and the venue was great too. I’m looking forward to the next one.
Here is my Data Visualization: