Identifying real names within BBC Search terms

Martin Belam  by Martin Belam, 18 March 2003

...or why Martin Luther King is my nemesis

I've been working on a way to identify the people or 'characters' that BBCi Search users are looking for. It is in a rough beta at the moment, but I am hoping we can eventually turn it into a 'People in the news' or 'People you are looking for' feature on the BBCi site

Screenshot of the BBCi Search names report for 17th March 2003

The way it works is by first identifying all of the searches on a given day that consist of two words. And if the first of those two words matches a list of around 5,000 first names*, it identifies that as a person / character.

Once it has identified that "robin cook" and "clare short" are names, it goes back through the entire set of search logfiles for that day and looks for any instance where "robin cook" or "clare short" appear as patterns. This works great because it means that if the search string is "robin cook and clare short oppose tony blair" the script will identify that this was an enquiry about both "robin cook" and "clare short".

And for that matter "tony blair".

But it falls down badly in a few ways

Firstly, it incorrectly identifies names as a human which I know are companies, not people or characters - "john lewis" & "thomas cook". and it identifies legitimate first names with a word that is not a surname - "kylie pictures" or "britney videos". Or it identifies something like "rose bush", which is both a legitimate forename, and a surname. But not a person.

I'm trying to combat this by having a 'blocked' list, which tells the script that "davis cup" and "daisy cutter" might meet the criteria for a name, but are not the real thing. But this relies on me checking every day. And I've been editing the allowed list of forenames whenever I have seen stuff that is *so* unlikely - (anyone reading this with the first names "chancellor" or "valencia"?).

Having identified this list of 'names' it is pretty simple to produce a chart of popular people**. Currently the report displays a top ten 'rising stars' and a top ten 'fading fast', by comparing the 'names' yesterday, with those of the day before. It also publishes a top 100, but behind the scenes I get to see a list of 250 each day, so hopefully I can intercept the worst examples before they hit the internal BBC consciousness.

I have some known issues with names that spread across three words - "osama bin laden", "george w bush", and "denise van outen" - all fail the two word name test. Aand "kylie" and "madonna" fail because as far as the script is concerned they don't have a consistent surname. I have a patch in mind for this.

But nothing has vexed me as much as Martin Luther, German Protestant, and Martin Luther King, American civil rights activist.

Myself, as the bastard offspring of a Catholic/Protestant upbringing, and a believer in direct civil action, I have a great deal of respect for both of them. But they break my script. It identifies "martin luther" as a legitimate name, and then it identifies "luther king" as a legitimate name. So a search for "martin luther king jr" counts for both of them. But there is no score for "martin luther king" an an entity in his own right. This has given me a dilemma. I can easily put in something that sniffs out this problem and corrects it - but is it an isolated case? If I start adjusting the script for this one problem, how many more are there to throw up that will skew the results.

My only option so far has been to expunge them both from the list of names - which is deeply unsatisfying. Any thoughts welcome.

*One unexpected side effect of this was I had to look at a lot of baby naming sites - and I was pretty knocked out by some of the names that were suggested for children...

**People searching for your name and being popular are not necessarily the same thing!


What about Mr Clutch?

Some slightly more serious thoughts.

1) A controlled vocabulary of company names would be no bad thing in any case, and probably better acquired wholesale than by you editing by hand

2) re: kylie pictures and rose bush - shouldn't you try to find some way use the actual data being searched, rather than the query strings, to answer these contextual questions for you? can just see the problem growing and growing otherwise.

3) surely you shouldn't be stopping at two - shouldn't it be a recursive/iterative (dunno, I did maths, me) process that will work for any length of pattern? That way you can pick up james t kirk and norman st john stevas

4) you should probably add a list of honorifics to the first names - that way you can pick up lord reith and mr burns

It pretty soon became apparent that the way I have gone about this is not scalable or practicable to maintain. However, for the purposes of what I want to get, a distributable list of maybe 5 or 10 names that have pricked the public consciousness in the last 24 hours, it is probably good enough.

I've been working on the three name list, and to be honest, the number that are searched for that are ever likely to generate enough searches to appear on this list comes to about 25.

The 'power curve' of searches [something Clay Shirky inadvertantly got me obsessed with] is such that by the time I am looking at the 100th most popular 'name', there have only been around 30 searches for them - and by the time i get to #250 on the list there will be barely 10 people searching for them on a given day. So if it starts making mistakes in that area it is really going to have no impact.

Interesting thought about honourific titles, I'm currently regretting the fact that my old 'Ladybird Book of Kings and Queens of England (and Scotland)' isn't sitting on our work bookshelf next to the O'Reilly publications...

Lingua::EN::NamedEntity may have been helpful here - maybe I'll throw that at Chipwrapper

Keep up to date on my new blog