Word bursts within BBC search logs
I have been doing a lot of thinking around how Jon Kleinberg's word bursts might be applied to my work with the search logs on the BBC website. I thought, and David Sifry confirms, the maths to do this on large data sets is mind-boggling.
However, we are already very good at picking out bursts of search activity on the BBC site based around search terms, like on the homepage - but they are not the smallest discrete unit. It is the words within the search terms that are the atoms I want to get at. This is really highlighted when I look at the unique number of words used on BBCi Search on any given day.
On Friday March 7th of the 100,000 or so unique words used, the 10th most common was 'war', and the 12th 'iraq'. However when I looked at the overall list of complete search terms I found the most popular search term to make reference to war was robot wars at #45. [I know the U.S. military is proud of its high-tech, but I don't believe they've got to the point yet where robots can do it all for them]. The first explicit mention of the impending war on Iraq in the search logs was at #97 with the search term 'anti war protests', and the one word search term 'war' itself came in at #175 on the day. Clearly BBC users are looking for information about 'war' and 'iraq', but they are also doing it with such a variety of phrases that we can't detect them with our current reporting tools
So what I set out to do was to capture these 'bursts' of words from day-to-day on the service. That in itself isn't hard, my principle of comparing snapshots of the service usage and calculating the differences works fine for this. But there is no context to the individual words. Because it is nearly comic relief day, there were bursts in the use of 'red', 'nose' and 'day' as individual words within search - but on their own, without a human eye over them, they don't logically group themselves together. I wanted to find a way to put them into their context automatically for our editorial team - so they can concentrate on finding the best sites for our users, and not have to second-guess how they are going to look for them.
And that started me thinking again about Maciej Ceglowski and Latent Semantic Indexing, and I then considered what would happen if I treated the search queries in the logfiles not as a set of text strings attempting to access the index, but as a discrete document set in their own right. I'm not yet at the stage of wanting to build a huge multi-dimensional vector model of the search terms, but it pointed me towards what I think will prove to be the key in extracting the context from these words.
Having established the individual words that have gained in popularity, I can then re-examine the searchlogs and pull out the search terms that contain these words, and arrange these into a hierarchy in, I believe, two ways. Firstly I can simply list the most common searches containing each word that appears on the 'burst' list. So far so boring.
What I think is more promising is that by scoring a search term each time it contains one of the 'burst' words, the search phrases that have more of these words in them will naturally float to the top, for example 'war on iraq', 'iraq war', 'saddam iraq war bush' would all score twice as many 'burst points' as simply a search for 'war' and 'iraq'. This should help us identify the phrases that people are using around popular topics which in their own right would not necessarily appear on our statistical reports. And I believe the next dimension is then to score and extract all the words from these searches and repeat the process.
So I've got a rough proof-of-concept chunk of code, and I've got a lot of variables to play with in the algorithm. It could take sometime to get right, and I am concerned that neither my Perl or my mathematics will scale up to meet the challenge, but the early results are promising - it seemed to be identifying the major trends in the use of search over this weekend, from the high profile death of Adam Faith, to the lower profile, but probably longer lasting impact of the Maltese referendum on EU entry.

excellent stuff...
publish that code... after all, we've already paid for it...
-tomski
Interesting idea. One thing you need to be clear about is that a word isn't necessarily a useful semantic unit. In fact, you need to look for complex or multi-word phrases as well, to avoid "britney" being one bursty word and "spears" being another.
This means doing a cleanse on the search queries in advance and working out (from the document base and from the queries users entered) what multi-word terms are relevant for your queries.
I know my old colleagues at Albert did a lot of work in this area; and we investigated this last year for a blog search engine. The chap to speak to is Mike Alderton, mike dot alderton at albert dot com
I should have said that latent semantic is not ideal in this situation as it suffers semantic lag. You'd need to recompute (i.e do the SVD very regularly) too often. you really want something that picks up changes more quickly: full-blown LSI is probably sledgehammer.
Reading this, the first thing it reminded me of was two-pass MPEG encoding. First of all you analyse the video signal to work out what to prioritise in it, and how best to set the quantisation/displacement factors, then you encode based on that. As a result you get a much more efficient, and more attractive looking re-encoded signal.
There may be some philosophy worth borrowing on from that.
(The most readable explanation of this I came across was in 'Interactive Television Demystified' - they should have copies on the 6th floor)
Thanks for the input guys - I aim to make a portable version that I could use on this site which i will certainly publish when done - and comparing it to image compression has sparked off in my head how I can visually represent what I am doing to people who otherwise are looking blankly at me.
However, as you may know, a not unexpected, but certainly disappointing development within BBC New Media - media.guardian.co.uk/newmedia/story/0,7496,912760,00.html - means I rather have my hands full at the moment, and haven't had any time to work on this further. Or anything else much for that matter.
I will get back to this when I get a chance.
*sigh*
Interesting stuff.
I would have thought a data-structure similar to a suffix tree or other pattern matching tree would be appropriate for what you want to do.
Combine that with score weighting on each atom and I think you've got what you want.
The more I think about this, the more I'm convinced that the solution lies in subverting data-compression technology. Finding those patterns is the key to crunching them.