A day in the life of BBCi Search - part 6

Martin Belam  by Martin Belam, 27 March 2003

The value of a Taxonomy

As I mentioned, BBCi Search has a team constantly monitoring the search activity on the site, and attempting to match the searches being made with the best possible content available, both on the BBCi site and on the web as a whole.

This role is crucial for a site the size of bbc.co.uk with an index which consists of in excess of 2,500,000 documents, without including the BBC News and BBC Sport content. It is the only way that the language of the users can be mapped to the taxonomical conventions of the organisation.

One example of this is users searching on the BBCi Science site for information on 'planets'. A examination of the search terms used on the site shows that 'planets' is consistently one of the most used search terms. However the BBCi Science homepage does not feature the word 'planets' at all. The site has plenty of contents about the solar system, but it is described as 'solar system', and branded "Space", to tie in with a television programme broadcast some 18 months ago.

The consequence of this is that a search for 'planets' that relies purely on a technological word matching solution returns as it top results information about "The Blue Planet" television programme - ironically probably the one planet in the solar system the user was least likely to be wanting information about.

In the absence of search technology with a better semantic understanding of the English language, the only way to align the vocabulary of the site with the vocabulary of users is to intervene, by providing 'best bet' results that originate form a taxonomical mapping of the content of the BBC site. It is only a human who can look at the that search, within that context, and decide that it equates to an individual piece of web content that the search technology would otherwise fail to return.

Another strong advantage of this system is the ability of the editorial team and taxonomists to assign new synonyms, best bet URLs, or change descriptions in real-time, in response to the actual recorded user behaviour.

A recent example of this was with the loss of the NASA Space Shuttle Columbia. The BBCi Search results pages include a news headline feed, if the query produces results from the BBC News or BBC Sport site that have been published within the last three days and cross a specific relevancy threshold. This worked fine if users were searching for "space shuttle" or "Columbia".

However we also saw, within three hours of the accident, that there had been a considerable rise in searches for the country "Colombia". Whilst it was conceivable that there was a simultaneous breaking news story about in Colombia, it was obvious that these were searches aimed at finding information about the space shuttle from users who were unaware of the correct name.

The result set they received was about the country, and did not produce any headlines about the space shuttle. Through the use of synonyming we were able to provide a result set that contained links to the latest news stories about the shuttle, even when people were unintentionally searching for the country. Again this is something that would be impossible with a reliance on technology alone.

Keep up to date on my new blog