By Ferenc Huszár, Member of Data Science London. Sr. Data Scientist @PeerIndex, PhD in Machine Learning @Cambridge_Uni
The other day I had the chance to chat with Martin Squires, Head of Customer Insight at Boots Ltd., the biggest pharmacy, healthcare and beauty chain in the UK.
Boots has one of the most active loyalty schemes in the UK with over 17 million active and engaged members. Through clever analysis of purchasing behaviour, testing and experimentation, Martin’s group gain very valuable insights into their customers: Who they are; What they buy; Where they buy it; Why they’re buying it, etc.
In one of his presentations to the Demographics User Group I found a clever summary of data analysis as three questions you should ask to make data work: What?, So What? and Now What? I like this approach so much that I decided to write it up for others to use it.
- What? (the data)
To derive value from data analysis, the first step is to consider what data you have. In our company, PeerIndex, the primary datasets we analyse are social media interaction data: likes, tweets, retweets, URLs shared and clicked on, mentions, etc. This data is vast and diverse, several terrabytes already, but we don’t stop here. There is often value in smaller datasets: custom experiments and tests we run, visitor and search behaviour on our website, user feedback, data from partners etc. Almost everything that is stored in a database somewhere is potentially useful data, so you should never ignore it. Remember that data does not have to be big to be useful, I am a big fan of small data, actually.
- So What? (insight and analysis)
A lot has been said and written recently about how big data is nearly useless unless you have ways and expertise to extract useful information from it. Answering this second question requires the strongest analytical skills. Counting your data with Apache Hadoopdoes not cut it anymore. To gain real insight to your data you have to know methods to visualise it, often via dimensionality reduction or clustering. You have to find the most useful transformations and views of your data and apply appropriate machine learning methods to make predictions.
Insight often boils down to identifying relevant latent variables: variables that cannot be directly measured, but whose value can be inferred from data. In our case these hidden variables are summarised by what we call the influence graph: a graph whose nodes are individuals, and whose weighted edges represent the extent of influence one person exerts on another in the context of a particular topic. The essence of our analysis is figuring out how this unobservable latent structure is related to observable facts in our data: likes, retweets and so on.
- Now What? (actions that should follow)
Now that we have come up with insight based on data, identified the relevant observable and latent variables, built predictive models, what can we use it for? Often, this step requires the most creativity. Our influence graph enables optimisations in various contexts. It can power content discovery applications, helps you identify people similar to those you follow, it enables influencer marketing and word-of-mouth marketing at scale, or finding the most influential advocates in a community. Figuring out how this rich structure can be best utilised in various contexts to make critical decisions is an interesting challenge.
I find these three questions very useful to summarise how one makes data work at a company. When designing a new data strategy, you often want to ask these questions in reverse order and work your way backwards: What are the decisions my company has to make, which are inefficient? What are the insights or analysis that would allow us to make those decisions more effectively? What is the data we likely need to complete this analysis? How can we get hold of such data?