Newspaper "Site Search Smackdown": Round 5 - The Newspapers vs Google
Over the last four days I've been pitting British newspaper site search engines together in a fight to the death to see who had the fresher index - the Newspaper Site Search Smackdown. The Independent triumphed, followed closely by the Daily Express and the Daily Mail.
The exercise was prompted by the fact that Google has launched "Search in search" boxes for several of those papers. Kevin Anderson argued that this was an improved user experience for most newspapers, as their own search facilities were poor. As well as comparing the newspapers with each other, I wanted to compare them with three Google-powered search services - Google Web Search, Google News, and Chipwrapper, my Google Custom Search engine that concentrates on British newspapers.
First thing in the morning on Saturday April 5th, I made a note of the ten most prominent headlines on each of eight major British newspaper sites. I then searched for each of those exact headlines on the newspaper's own search service, and also on Google, Google News and Chipwrapper. Services were scored on whether they had managed to quickly index the days news or not.
As we saw from the first 4 rounds of the Newspaper Site Search Smackdown, the results for newspaper's own search services were a mixed bag. The Independent and the Daily Express were the only two newspapers who had managed to index every one of their major stories by 9am on the morning they had been published. The Daily Mail may also have indexed all ten, but a bug in their search system prevented me from testing one headline.
The Guardian and The Mirror put in a so-so performance, with only 7 of their top 10 stories available via search results when you searched for the associated headline.
The really poor results came from The Sun, The Telegraph and The Times. The Sun and The Telegraph had only managed to include half of their major stories of the day in their search index, whilst The Times couldn't even manage that. Only 4 stories from the front page had made it into the search results.
Google Web Search
Newspaper search indexes ought to be built with access to their CMS, allowing them to index stories faster than a regular search engine spider crawl should manage, so the test for Google was to see if it could crawl, rank and index stories as fast as these closed web infrastructures.
There used to be a time when search engines were not so fussy about crawling news content - but one of the unexpected impacts of the terrorist attacks on the USA in September 2001 was a change in the way news was handled on the web. All the search engines realised it wasn't good enough that when people were searching for 'world trade center' on 9/11, they were returning results offering webcam views from a building that was collapsing, with no reference to the momentous events of the day. Subsequently, news content has been crawled and indexed with an increasing frequency, over and above the regular crawls and re-indexing of the web that they carry out.
Google demonstrated a ruthless efficiency in crawling newspaper sites during my tests. In all cases except one (the Daily Express), Google matched or exceeded a newspapers ability to index and recall their own content within a few hours of it being published.
Across the whole of my tests on eighty stories, only three stories could be found by a newspaper site search, but weren't in the top results on Google after a direct search for their headline. "Victoria Beckham snapped posing in a shop mirror" from The Mirror, "British Muslims in airliner terror plot 'wanted to take wives and children on their suicide missions'" from the Mail, and the rather curiously headlined "Stardate, today: Kiely's log...stop Kranjear and Pompey" from the Express were the elusive 3 stories.
By contrast, there were 21 stories that could be found via Google, but not by searching the newspaper that published them. That represents 26% of the sample size of the test. It was quite clear from the figures that Google has the edge over newspaper search engines - but would Google's specific news product do even better?
Google's News service was designed not just to bring a personalised computer-algorithm determined view of the world to users, but also to allow them to search through the stories from those news sources.
Interestingly enough, it turned out that Google News was less effective than the general Google web search at turning up recently published stories from British newspapers. There were 22 stories in my sample of 80 that were indexed in Google's general web search, but which could not be found in Google News.
However, the picture isn't as simple as just being an indexing issue. Google has a deal in place with Associated Press whereby the search engine hosts copies of AP stories on Google's servers, and displays them directly in Google News results. That means that newspapers which simply re-publish AP content find their version of the story de-duped from the Google News results in favour of the original copy.
Chipwrapper is a search engine for British newspapers that I put together using free and Open Source tools last year. The main component is a Google Custom Search Engine, which only returns results from the websites of the major papers.
Although I expected the results to be very similar to Google's general web results, I wanted to check to see if there was any latency in Google's indexing between the two services. It seemed that there wasn't. Across the 80 searches I only found 3 cases where the results significantly differed. One Daily Express story could be found in Google, and not in Chipwrapper, and a Mirror and a Mail story could be found on the first page of Chipwrapper results, but not on Google. Otherwise the set of data was identical.
The general consensus has to be that Google absolutely rules the roost on these tests. Only the Daily Mail, Daily Express and The Independent got close to the indexing speed of the search engine giant. When I re-tested some selected content later the same day, I found there were some stories from The Times for example, which had been indexed by Google 14 hours previously, but which still could not be retrieved by search on The Times site itself.
If the question posed by Google's "Search in search" feature is 'Can Google do a better job of indexing and retrieving newspaper content than newspapers themselves', then the answer seems to be very much yes, it can, and does.
|Stories indexed by|
|Newspaper||Site search||Google Web||Google News||Chipwrapper|
In Round 6
So we've seen that Google can index content faster than most newspapers can, but how does it stack up against competitor search engines? In the final round of the Newspaper Site Search Smackdown, I'll be testing the speed of Google's indexing against that of The Search Engine All-Stars - Yahoo!, Ask and MSN Live.
The BBC is another one - I rarely use their site search because it's so bad.
nb Martin - the digg this! link from your feed is chopping your headlines up for some reason.