Word bursts within BBC search logs
I have been doing a lot of thinking around how Jon Kleinberg's word bursts might be applied to my work with the search logs on the BBC website. I thought, and David Sifry confirms, the maths to do this on large data sets is mind-boggling.
However, we are already very good at picking out bursts of search activity on the BBC site based around search terms, like on the homepage - but they are not the smallest discrete unit. It is the words within the search terms that are the atoms I want to get at. This is really highlighted when I look at the unique number of words used on BBCi Search on any given day.
On Friday March 7th of the 100,000 or so unique words used, the 10th most common was 'war', and the 12th 'iraq'. However when I looked at the overall list of complete search terms I found the most popular search term to make reference to war was robot wars at #45. [I know the U.S. military is proud of its high-tech, but I don't believe they've got to the point yet where robots can do it all for them]. The first explicit mention of the impending war on Iraq in the search logs was at #97 with the search term 'anti war protests', and the one word search term 'war' itself came in at #175 on the day. Clearly BBC users are looking for information about 'war' and 'iraq', but they are also doing it with such a variety of phrases that we can't detect them with our current reporting tools
So what I set out to do was to capture these 'bursts' of words from day-to-day on the service. That in itself isn't hard, my principle of comparing snapshots of the service usage and calculating the differences works fine for this. But there is no context to the individual words. Because it is nearly comic relief day, there were bursts in the use of 'red', 'nose' and 'day' as individual words within search - but on their own, without a human eye over them, they don't logically group themselves together. I wanted to find a way to put them into their context automatically for our editorial team - so they can concentrate on finding the best sites for our users, and not have to second-guess how they are going to look for them.
And that started me thinking again about Maciej Ceglowski and Latent Semantic Indexing, and I then considered what would happen if I treated the search queries in the logfiles not as a set of text strings attempting to access the index, but as a discrete document set in their own right. I'm not yet at the stage of wanting to build a huge multi-dimensional vector model of the search terms, but it pointed me towards what I think will prove to be the key in extracting the context from these words.
Having established the individual words that have gained in popularity, I can then re-examine the searchlogs and pull out the search terms that contain these words, and arrange these into a hierarchy in, I believe, two ways. Firstly I can simply list the most common searches containing each word that appears on the 'burst' list. So far so boring.
What I think is more promising is that by scoring a search term each time it contains one of the 'burst' words, the search phrases that have more of these words in them will naturally float to the top, for example 'war on iraq', 'iraq war', 'saddam iraq war bush' would all score twice as many 'burst points' as simply a search for 'war' and 'iraq'. This should help us identify the phrases that people are using around popular topics which in their own right would not necessarily appear on our statistical reports. And I believe the next dimension is then to score and extract all the words from these searches and repeat the process.
So I've got a rough proof-of-concept chunk of code, and I've got a lot of variables to play with in the algorithm. It could take sometime to get right, and I am concerned that neither my Perl or my mathematics will scale up to meet the challenge, but the early results are promising - it seemed to be identifying the major trends in the use of search over this weekend, from the high profile death of Adam Faith, to the lower profile, but probably longer lasting impact of the Maltese referendum on EU entry.