British newspaper robots.txt files
When The Times announced that they would be supporting the ACAP protocol, they did so by re-formatting their robots.txt file to include the new style directives to search engine spiders. It made me curious as to what might be in the robots.txt files of the other major newspapers in the UK, and so I thought I'd have a peek.
The Times appears to be the only major newspaper site where a concerted effort has been made to keep out particular nuisance spiders. Their robots.txt file explicitly bans 124 robots from spidering the site.
By contrast, in his piece about ACAP, Ian Douglas at The Telegraph mentioned that they tried to keep their robots.txt file as small as possible. This is true, as it only consists of 5 lines, allowing all robots and only closing two directories off.
The Daily Mail use their robots.txt instructions to forbid their site being harvested by one particular spider - Alexa's ia_archiver. They also have three very specific disallow instructions:
The Guardian's robots.txt file is also lightweight, and notably does one thing that I'd recommend all sites with a similar function do. They disallow from spidering any of the 'print' layout versions of their pages:
This ensures that search engines only grab the main version of the page, and don't clog up their indexes with duplicate copies of the same material in different layouts.
The Sun's version of robots.txt is interesting chiefly for the fact that it gives away the address of The Sun's Polish edition. Apparently mocked up as part of an internal staff competition, listing the address as a disallow is intended to keep out the prying eyes of search engines, but at the same time it lets us humans know exactly where to find it - or at least where to find it yesterday afternoon - it appears to be 404ing right now.
I wonder if their Polish language RSS feeds work - because their English language ones certainly don't at the moment.
Three of the newspapers I looked at didn't appear to have a robots.txt file at all - The Daily Express, Daily Star and The Independent. Of course, it isn't mandatory to have one, and they aren't doing anything wrong if they don't have any specific instructions for search engines, but it is generally held to be good website admin to have one.
Not having a robots.txt also rather implies that nobody checks the server error logs for those three sites very regularly. The robots.txt file is one that is called time and time again, and if you don't have one, your error logs are clogged up with entirely unnecessary lines of 404 errors, generated as polite spiders check to see if they should obey any specific instructions.