“Search Within A Search” article in ei magazine

 by Martin Belam, 21 July 2004

I have an article in print in this months ei magazine - “Search within a search”. I didn’t write my own blurb, which makes the whole thing sound like an exciting adventure with the BBC’s only producer....

“It was what the brains at the BBC’s new media and technology department least expected when they started a routine statistical measure. While measuring search usage, they discovered a way to decipher search queries from a two-million-page website to help provide better, more relevant answers. Martin Belam, departmental development producer, explains the process.”

Here is the full text of the article...

“Search Within A Search”

About two years ago, I found myself in this enviable position while tackling a project in the central new media department at the BBC. We wanted to integrate existing systems into a new global search that could be launched from any page on the site. We planned to measure the level of how people were using the search service on the corporation’s website by examining log files.

But while examining those files, I realised we were collecting a wealth of data about how the search service was being used in real life. This was of great value because of the editorial proposition behind the BBC’s search service – to recommend family-friendly and UK-focussed links. For this purpose, we built a directory of over 10,000 recommended links relevant to the query. These links were returned at the top of the results set.

So, throwing aside our best-laid plans, we decided to exploit this information to streamline the search function and make it more intuitive.

Hidden information

The data in our search logs carries a lot more information than just the search term used. It includes the timestamp of when a search was performed, the number of ‘best links’ we returned from our directory for that term, and an indication of which ‘scope’ and which ‘tab’ were selected at the time of the search. We also pass through a unique ID number for each user with a cookie on their machine. Additionally, there is a ‘go’ value that allows us to tag specific instances of the search box. A typical line of log file would look something like this:

cgiperl3 23498 101. 4, 0:27:13 GMT "train tickets to york" r=2 t=0.02 u=d367aea9abeccaf55c595f3c1e3e8355 s= t=www g=homepage f=p

This line of log file code indicates a user with the unique cookie ID string "d367aea9abeccaf55c595f3c1e3e8355" searched for train tickets to York at 12:27am. Their search was processed on the machine "cgiperl3", which took 0.02 seconds to respond to the request (t=0.02). We returned two best links from our directory (r=2), and the request was made via the bbc.co.uk homepage (g=homepage) and the user was searching using the web tab (t=www). As the s= value is null, we know the search was carried out over the default scope, which is an index of all bbc.co.uk content excluding news articles.

Within the systems I developed, we do three pre-processing steps of the log file before we extract data from them.

  • We reduce the white space within the search queries by trimming any leading or trailing white space from the search terms. We also compress any white space consisting of multiple characters into single white space entities. Within a search term, multiple white spaces do not carry any semantic meaning and are usually present due to erroneous typing by the audience;
  • We usually remove any Boolean operators from the search strings. For example, a search for ‘top of the pops fearne’ has the same meaning as ‘top of the pops + fearne’. (Naturally, if we were actually attempting to examine the usage of these advanced types of search features we would omit this step);
  • We adjust all of the search terms to consist of all lower-case characters. For the purposes of understanding the intentions of a search-engine user, the search strings ‘eastenders’, ‘Eastenders’, ‘EASTENDERS’ and the correctly branded ‘EastEnders’ are all equivalent. The important factor is finding the best way to aggregate the data meaningfully. Forcing all of the search terms to be lower case groups together terms where the only difference is case sensitivity.

Search across bbc.co.uk

A considerable amount of resources have gone into developing a unified search across bbc.co.uk in recent years. It’s also been a difficult proposition because it needs to deal intelligently with the 43 languages of the BBC’s World Service output, and a large part of the site is made up of news articles from BBC News Online. These need careful handling, as a search for a topical news story needs to return the most recently filed story on a topic, not necessarily the one a computer algorithm thinks is most relevant. Additionally, because the breadth of content is so great, a word can have many different contexts. A search for ‘china’ on the BBC News site has a very different semantic intention to a search for ‘china’ when a user is visiting bbc.co.uk/antiques.

To provide a consistent user experience, every page on the bbc.co.uk website features a global navigation toolbar at the top, which includes a search input box. Normally, a search performed from this box will search over a narrow ‘scoped’ subset of the available index data. The user can then activate tabs that select a wider dataset to perform the search. Generally, there will be a site-specific scope tab on the left related to the top-level domain of the page. For example, if the URL starts www.bbc.co.uk/gardening, then ‘Results from gardening’ will appear first, or ‘Results from Radio 3’ will appear if the URL starts www.bbc.co.uk/radio3. The pages on these types of results will be taken from these ‘scoped’ indexes.

Selecting the next tab across gives the user ‘Results from the BBC’. This takes as its dataset all of the pages indexed within the bbc.co.uk site, excluding BBC News Online content. The next tab along, ‘Results from BBC News’, allows the user to select their results from stories published by BBC News Online. The final tab on the right, ‘Results from the Web’, allows the user access to the results from the filtered web results provided to the BBC by a third party.

For example, at the time of writing, a search for ‘Iraq’ from the bbc.co.uk homepage would combine results from four datasets. In the main body of results are the three ‘Best links’ selected by our editorial team from the BBC site – an article on ‘After Saddam’, a country profile and a feature on ‘The Lost Palaces of Iraq’. Below this, the most recent relevant news story is inserted. Underneath this comes the regular run of results from our main index – in this case an article from the BBC Schools site and another from Radio 4’s Today programme. Offset to the right, we display some links editorially selected from around the web – here we link to ArabNet and to the British Government’s Foreign & Commonwealth Office information on relations with Iraq.

Making data usable

Just examining the data, however interesting, is not a goal in itself. The goal was to provide a meaningful series of views of the data that could be used by a wide range of people within the BBC to understand how our audience interacted with our search tool. After all, typing free text into a box is one of the most openly-ended, interactive actions a user can take on a website or intranet site, and it is important to capture this.

I first began providing a daily report of the top 500 searches that had been carried out across the whole site. We discovered our top searches were reasonably static, featuring popular programmes and personalities, as well as wide genres like weather.

My aim was to provide tools that would be useful to editorial teams creating content and to the taxonomy team building our directory of recommended sites. For this reason, it was important to be responsive to their needs and requests. By observing their patterns of activity while they worked using the reports, I was able to make improvements.

Following an examination of the report of the most popular keywords used by the audience, the editorial teams would frequently perform a search checking the results the audience would have seen. I therefore made links of the keywords appearing on the report. When clicked, they open up the search results for that term in a new window. This saved the editorial team time.

The team also pointed out that it was difficult to keep track of where they had reached when checking reports. Using cascading style sheets, I ensured links were displayed in distinct colours, clearly indicating each visited link.

Measuring specialised areas of content

Log files contain information on which ‘scope’ a search is performed from. And while we require BBC Search to be a global solution to queries, it’s also important to recognise once a user has navigated to our Music site to hunt for something, chances are they will be interested in results about music. On the majority of pages of the website, a search will return content specific to that area of the site.

You can demonstrate this very easily on the site. Go to bbc.co.uk/music and search for ‘jam’. Then go to bbc.co.uk/food and search for ‘jam’. You will see the majority of results from each query will be from an area of the site with a particular interpretation of ‘jam’ as a concept: fruit preserve or Paul Weller’s first band.

Since we pass the value of this ‘scope’ through to the log files, we can examine only the searches that have taken place in these areas of the site. This means the editorial team creating the content of the ‘Science & nature’ area of the site can view the search activity that has taken place on their pages. This may inspire them to make new content, to find relevant links on the rest of the internet they can recommend, or make them think carefully about their content labelling schema and branding.

The reports are simultaneously generated as HTML output and a CSV file, allowing search activities to be easily imported into Microsoft Excel for statistical manipulation. They are then sent out as e-mails to relevant team members.

In any business, whether publicly or privately funded, it’s important not to waste processing resource on reports that aren’t useful or used. By providing the data in the ‘push’ format of sending an e-mail, I found there was a higher interest in the data, since people got a reminder every single week that it was available for them to look at.

Measuring specific input points

We can tag selected search input boxes across the site by including a special piece of information called a ‘go tag’. It acts as a hidden value in the HTML form that submits the search query. We can subsequently filter our reports to only look at searches from these particular input points. We do this in three places in particular: on the bbc.co.uk homepage, on the recently launched CBBC Search service, and on the pan-BBC feedback pages.

The data is used in different ways. For example, with CBBC Search we can ensure the editorial focus of the service is specifically targeted to the way this particular subset of our audience searches for material – providing specific links and synonyms tailored around children’s interests. On the pan-BBC feedback page we can identify the areas where we are generating the most requests for information. Consequently, we can tailor the text of the page and the options available to focus on these persistent requests. On the bbc.co.uk homepage, the tag on the box allows us to identify the proportion of our users who came to that page to perform a search rather than to browse and navigate, bringing us a better understanding of our audience’s needs.

URLs and spelling

At the BBC, a significant proportion of search terms include URLs. For example, it’s not uncommon to see searches like ‘www.eastenders.co.uk’ or www.bbc.co.uk/smile’. In response, I devised a report that looked for fragments of search terms that might represent URLs – ‘www’, ‘.co’, ‘.uk’, ‘.org’, ‘.net’, ‘.com’, ‘.info’, and ‘http’. By looking at this report our taxonomy team is able to see which URLs our audience regularly searches for. They can then set the URLs as synonyms within our directory.

Typing ‘www.eastenders.co.uk’ into the search box on bbc.co.uk will instruct the search technology to look for pages containing the text string www.eastenders.co.uk’. Using our best links system, we can intercept this search and ensure the first link returned to the user is a link to the bbc.co.uk EastEnders homepage. We have used our understanding of user behaviour to lead them to the right result, even if they have used the search incorrectly.

We can also do this when we see regularly occurring misspellings. One BBC brand that is notoriously difficult to spell is our digital channel for pre-school children, BBC CBeebies. We have a variety of synonyms that will ensure our audience reach their desired destination.

I have ample evidence this approach is beneficial to users. Because Google doesn’t make a similar correction in its search facility, I’ve discovered a number of their lost users end up at my website reading an article about spelling correction using cbbies and ceebeebies as examples. This happened to such an extent, I eventually altered my page to include a prominent link to CBeebies to help direct these lost users on their way.

Alphabetical view

A different view of the data we utilise is an alphabetical listing. Every day a report is produced listing the thirty most popular search terms that start with each letter. I had to consider specific rules for two letters in particular. ‘T’ generates a lot of search terms starting with the word ‘The’. A decision had to be made whether to ignore these, and we found on the whole it was beneficial to do so.

The letter ‘W’ also had to be treated as a special case. A lot of inexperienced users type URLs into the search box. For example, ‘www.bbc.co.uk/cbbc’. We wanted to see search terms that genuinely began with a ‘W’ - not those beginning with the World Wide Web acronym. So in this data view, I discounted any search term beginning with ‘www’.

Search terms that naturally gravitate to the top of these letter-specific lists usually revolve around BBC brands and programmes. So this alphabetical data view is particularly useful for compiling the BBC’s A-Z Index, which is accessible from every page on the bbc.co.uk site by using the A-Z Index link on the global toolbar.

The A-Z index pages consist of one page per letter, containing a categorised list of links to content within the website. Additionally, to reduce the amount of scrolling necessary by the user, the right hand column of each page features a selected list of ‘Popular Links’ for that letter. On the ‘O’ page they include ‘Olympics 2004’, ‘On This Day’ and ‘Only Fools and Horses’.

We also measure the behaviour of users who have navigated to the A-Z index pages and have then performed a search. This suggests they were unable to find what they were looking for in the index. An example of this was a dramatic rise in the number of searches for the word ‘iraq’ made from the A-Z pages themselves in the run-up to the conflict in that country in 2003. The team maintaining the A-Z index was able to see and act upon the demand for a link to the latest news about Iraq being added to the ‘I’ page.

Questions

Each day, a report is compiled on questions that have been typed into the search box. Inexperienced and young users are inclined to enter into an overt dialogue with the search box, so we often see search terms starting with words like ‘How can’ and ‘What is’.

I devised a script that would trawl through the day’s search logs and would extract from them search terms beginning with a list of ‘question’ words. The script identifies and outputs search terms starting with: ‘how can’, ‘how do’, ‘how does’, ‘who is’, ‘who are’, ‘what was’, ‘what do’, ‘what will’, ‘what is’, ‘what are’, ‘when is’, ‘when will’, ‘when was’, and ‘tell me’.

Examining search terms can be very useful in building the contents of an FAQ page. Instead of having to imagine what the most likely questions from users are going to be, search logs allow you to gather real evidence of the questions actually being asked.

Null results view

Another data view that proved useful was the ‘Null Results’ view. This focussed our attention on where we were missing content. Encoded in our search logs is the number of ‘Best links’ returned from our directory for that particular search term. On a daily basis, the report displays the most popular terms we returned no best links for. This allows our teams to focus on where we are failing to provide a link, without having to wade through lists of words we are already recommending sites for.

Since the aim of the ‘Best links’ directory is to provide editorially recommended links to the largest reach of our audience possible, it is important for us to track new terms that rise in popularity. The theory behind this report is that by examining the 200 search terms that have become prominent in the last 24 hours on an iterative basis, we can better provide relevant links to a larger proportion of our audience.

It is particularly good for noting topical keywords for news stories, or people featuring in the news, on TV or radio. It also gives us an instant view on how users are performing searches for these new concepts, giving us a ready-made set of real-life synonyms for any new topic of interest to our audience.

Conclusions and recommendations

The BBC website is one of the largest and most popular in the world. It gives us an incredible dataset of the actions of our audience to work with. However, the lessons extracted from examining the search logs are lessons that can be applied to any subsection of that audience, and to smaller websites and intranet sites. We consulted regularly with the team working on the search application for the BBC’s own intranet, Gateway, and this allowed them to incorporate much of our learning into their processes. Within any enterprise, you are likely to face the same mixed ability to spell and type, the same human traits of finding many different ways of asking the same question, and the same varied approach to using a search input box as a tool.

If you are undertaking a similar project to analyse your search log files, I recommend you do the following:

  1. Pre-process log files to make sure you can aggregate your data into meaningful results;
  2. Build reports that show data views people can actually use. And if necessary, rebuild them to reflect that use. At the BBC, inserting links into the reports increased their usefulness;
  3. Find ways to ‘push’ the data at people via e-mail and with reminders.

Search-log analysis is an excellent tool for the following purposes:

  1. Identifying gaps in the content you have and that your users are actively searching for;
  2. Identifying areas of content you do have but are poorly sign-posted, forcing users to use the search function;
  3. Identifying synonyms and common spelling variants for your key concepts and content areas.

Despite the success of our use of search logs, you need to be wary of using them to extrapolate broad generalisations about your entire user base. A significant proportion of your visitors may never use, or very rarely use, your search facility. Equally, judging that your search is a success based on search logs alone is inadvisable. You cannot measure whether your users were satisfied or not with the search results they received. If they don’t come back and refine their search again it may be because they found what they wanted. Or it may be they were so disillusioned with the results they got the first time, they gave up altogether.

The information gleaned from examining this data isn’t going to solve all of the information retrieval issues your internet or intranet site will face. But it is an excellent starting point. It gives a valuable insight into what your users are looking for, what they can’t find and the vocabulary they use to locate the concepts you may describe in a different way.

A search box is often the first port of call for a user when they wish to initiate a dialogue with your site. Make sure that you are listening to what they’re asking for.

Keep up to date on my new blog