“Searching 1,000 years of history at the National Archives” - Tim Gollins at Enterprise Search Europe

Martin Belam by Martin Belam, 1 November 2011

At the recent Enterprise Search Europe conference where I was talking about search on Guardian Books and the future of search, the most entertaining and illuminating talk I saw was by Tim Gollins, Head of Digital Preservation at the National Archives.

He described the National Archives as having 1,000 years of history literally sitting in cupboards. Although the technology had changed, the purpose of the collection, to provide the right piece of information to the right people at the right time, had not. Back in the day, he said, you had medieval clerks with complicated information structures written on parchment. The massive bound records they store are not just the records and documents themselves, they are the search engine. And every different era developed new ways of indexing information because there were different users needs. There is nothing new in what we are talking about with search engines, he said, the idea of business and user needs is literally thousands of years old. The red lines etched into the Domesday Book are actually a system of in-line references and a thesaurus.

Search and the web have introduced a new tier of users to the archives, and a new way of delivering information. Like most sites, they started by having a little box in the top right-hand corner, and users generously typed in 2.5 words, expecting the machine to sift through those thousands of years of history. Clearly, they didn’t get great results.

Tim and his colleagues reassessed the problem, and asked themselves “What is going wrong? What are we trying to do?”

They realised that what they were trying to do was to integrate huge rafts of different types of data through the search engine, and realised that the commercial technology they were using couldn’t achieve this integration. They are trying to simultaneously serve enormously different audiences i.e. the public searching for granddad’s war record, and a civil servant who wants advice on the best records management process, and a schoolkid trying to answer a homework project. “Search simply hasn’t a hope in dealing with that variety” said Tim.

People constantly said to them “but if you go onto google and type ‘national archive’ plus the thing you want” you get it straight away. I think that question is the bane of anybody who runs any kind of site search engine. The National Archives team put it to an empirical test, and found, lo and behold, that Google was better than their site search.

Sort of.

They discovered, of course, that Google doesn’t do what everybody thinks it does. It doesn’t index every single possible thing on the internet. It indexes and ranks popular things on the internet. The National Archives found that only about 40% of their site was indexed by Google. And, naturally, Google could only possibly learn to reference something that had been published on the web and linked to.

They’ve now started building their own search, which you can test during development at discovery.nationalarchives.gov.uk. Whilst some of their stack is .net, the search uses MongoDB and Solr - a couple of technologies we increasingly use at the Guardian.

The new search uses facets to allow you to refine the results in several dimensions, including time, and the government departments that generated the documentation. It also includes some rather user-unfriendly squiggly references to document collections and locations - but then the librarians need these. A key thing to understand about searching the national archive is that it is an impossible task for them to digitise the collection. Tim pulled a figure from the air and reckoned they had 1 billion pages of information. They have something like 20 million items in the index at the moment, but he anticipates that will easily increase to 100 million in the next few years.

A search for ‘Belam’ in the archives reveals amongst other things a potential distant relative who commanded a prison hulk in Bermuda in the 1800s, but I can’t actually read the documents. The search engine is looking at the index, and letting me know that there is a reference to a ‘Belam’ in a box somewhere that will be crammed full of papers from the time.

Tim said that the system had in part been designed to reduce footfall at the archives, because it would make it easier for people to research remotely. In fact, the opposite seems to be the case, as it has opened up to public scrutiny the copious indexes of non-digitised content.

At the end of his talk, Tim Gollins explained how the National Archives had gone on a journey that many businesses and organisations are yet to carry out when they think about search. Five years ago, he said, they thought they could solve the problem by “chucking technology at it.”

They learnt that this could sometimes make things worse, as it gives an illusion of precision that doesn’t exist. He now says: “Technology isn’t the problem with search. Think about your information architecture model and your business problem first, and then choose which bit of the business problem to solve.”

Keep up to date on my new blog