"Understand your data!" - Iain Fletcher on optimising search technology at Online Information 2010
Last week I was presenting at the Online Information conference in London with my colleague Peter Martin, where we gave a talk on "Mapping the Guardian's tags to the web of data". As well as moderating that session, I got to attend a few other talks during the course of the event.
Yesterday I blogged about a session I attended about social media in the enterprise featuring Gordon Vala-Webb, Helen Clegg, Hugo Evans & Angela Ashenden. Today I wanted to share my notes from a talk given by Iain Fletcher, who works at Search Technologies, and who was standing in for Paul Nelson. He started by describing what the company does in quite bullish fashion:
"We install search engines properly"
His point was that they don't resell a specific technology, and so therefore approach implementing search 'the right way'. Instead of getting carried away with what whizzy new features can be implemented, they stress to their clients that actually the key to decent search engine results is in the dull work of getting the underlying data structure right. Iain also stated that, when approaching the problem of search, 90% of their clients end up at some point in the process saying "But why can't it work just like Google does?".
I've spoken before about the difference between site search and web search being domain expertise and knowledge. So, for example, if you are searching for a document you know exists on an enterprise or site search engine, you put in a query that you expect to retrieve the document you already have in mind. If the search engine fails to return it, you blame the search technology.
By contrast, if you search Google for topic x on the Internet, it will return you some documents on topic x. You don't have comparable domain expertise on the whole of the web of documents to be able to judge truly if these are the best documents, and you seldom have one specific document in mind when you search Google.
In his talk, Iain made a couple of interesting points about the difficulty of replicating the Google experience on site search, both related to the way that Google effectively "crowd-sources" some of the factors in their ranking algorithm.
The thing that set Google apart from other engines like AltaVista or HotBot when it arrived on the scene was the genius of the PageRank algorithm. Google has a map of how the whole of the internet regards the value of any individual document. In an enterprise or site search situation, whilst you may be able to use your own internal linking structure as a ranking factor, you do not normally have access to data on how anybody else views what are the most important documents in your index.
He also said that the impact of the $1bn+ SEO industry was to improve Google's results for them. Of course, whilst some SEO activity can be directed to shady dealings, he used the example of his own company targeting the specific phrase "fast ESP services". It had taken them 12 iterations of content to get exactly the right length of document, keyword density, inbound links and metadata to take the #1 spot. When was the last time you heard of anyone at a publisher or within an enterprise iterating their content 12 times specifically to make sure they got the top spot on their internal search technology? It just doesn't happen.
Iain had observed an increasing trend of companies giving up on training their users to make better search queries. The most you can expect to get out of people is something like 2.7 words. Instead, guided or faceted navigation was becoming very important for site search. Using techniques that allow users to click links and make choices that increase the complexity of the query behind the scenes, whilst still leaving them free to type in a couple of words, seemed to be the way to go.
I really enjoyed Iain's presentation - and given that he was using someone else's slides, I was impressed with the way he delivered it. He talked about "giving your data the respect it deserves", and how keeping an enterprise or site search well-tuned was an ongoing process of data management, rather than a case of setting it up once and leaving it to run.
In the next post based on my notes from Online Information 2010, I'll be blogging about John Sheridan's enlightening talk on 'parsing data from legislation'.