Identifying real names within BBC Search terms
...or why Martin Luther King is my nemesis
I've been working on a way to identify the people or 'characters' that BBCi Search users are looking for. It is in a rough beta at the moment, but I am hoping we can eventually turn it into a 'People in the news' or 'People you are looking for' feature on the BBCi site
The way it works is by first identifying all of the searches on a given day that consist of two words. And if the first of those two words matches a list of around 5,000 first names*, it identifies that as a person / character.
Once it has identified that "robin cook" and "clare short" are names, it goes back through the entire set of search logfiles for that day and looks for any instance where "robin cook" or "clare short" appear as patterns. This works great because it means that if the search string is "robin cook and clare short oppose tony blair" the script will identify that this was an enquiry about both "robin cook" and "clare short".
And for that matter "tony blair".
But it falls down badly in a few ways
Firstly, it incorrectly identifies names as a human which I know are companies, not people or characters - "john lewis" & "thomas cook". and it identifies legitimate first names with a word that is not a surname - "kylie pictures" or "britney videos". Or it identifies something like "rose bush", which is both a legitimate forename, and a surname. But not a person.
I'm trying to combat this by having a 'blocked' list, which tells the script that "davis cup" and "daisy cutter" might meet the criteria for a name, but are not the real thing. But this relies on me checking every day. And I've been editing the allowed list of forenames whenever I have seen stuff that is *so* unlikely - (anyone reading this with the first names "chancellor" or "valencia"?).
Having identified this list of 'names' it is pretty simple to produce a chart of popular people**. Currently the report displays a top ten 'rising stars' and a top ten 'fading fast', by comparing the 'names' yesterday, with those of the day before. It also publishes a top 100, but behind the scenes I get to see a list of 250 each day, so hopefully I can intercept the worst examples before they hit the internal BBC consciousness.
I have some known issues with names that spread across three words - "osama bin laden", "george w bush", and "denise van outen" - all fail the two word name test. Aand "kylie" and "madonna" fail because as far as the script is concerned they don't have a consistent surname. I have a patch in mind for this.
But nothing has vexed me as much as Martin Luther, German Protestant, and Martin Luther King, American civil rights activist.
Myself, as the bastard offspring of a Catholic/Protestant upbringing, and a believer in direct civil action, I have a great deal of respect for both of them. But they break my script. It identifies "martin luther" as a legitimate name, and then it identifies "luther king" as a legitimate name. So a search for "martin luther king jr" counts for both of them. But there is no score for "martin luther king" an an entity in his own right. This has given me a dilemma. I can easily put in something that sniffs out this problem and corrects it - but is it an isolated case? If I start adjusting the script for this one problem, how many more are there to throw up that will skew the results.
My only option so far has been to expunge them both from the list of names - which is deeply unsatisfying. Any thoughts welcome.
*One unexpected side effect of this was I had to look at a lot of baby naming sites - and I was pretty knocked out by some of the names that were suggested for children...
**People searching for your name and being popular are not necessarily the same thing!