British newspaper robots.txt files

Martin Belam
Written by
Published 6 December, 2007
Categories: ,

<< previous | next >>
No comments yet 
Add your comment Add your comment

When The Times announced that they would be supporting the ACAP protocol, they did so by re-formatting their robots.txt file to include the new style directives to search engine spiders. It made me curious as to what might be in the robots.txt files of the other major newspapers in the UK, and so I thought I'd have a peek.

A robot of death, not a search engine robot

The Times appears to be the only major newspaper site where a concerted effort has been made to keep out particular nuisance spiders. Their robots.txt file explicitly bans 124 robots from spidering the site.

By contrast, in his piece about ACAP, Ian Douglas at The Telegraph mentioned that they tried to keep their robots.txt file as small as possible. This is true, as it only consists of 5 lines, allowing all robots and only closing two directories off.

The Daily Mail use their robots.txt instructions to forbid their site being harvested by one particular spider - Alexa's ia_archiver. They also have three very specific disallow instructions:

Disallow: /pages/dmstandard/frame.html?in_bottom=http://www.motors.co.uk/
Disallow: http://img.dailymail.co.uk/i/sponsors/chevrolet
Disallow: /pages/galleries/index.html?in_gallery_id=10837&in_page_id=1055

The Guardian's robots.txt file is also lightweight, and notably does one thing that I'd recommend all sites with a similar function do. They disallow from spidering any of the 'print' layout versions of their pages:

Disallow: /*/print$

This ensures that search engines only grab the main version of the page, and don't clog up their indexes with duplicate copies of the same material in different layouts.

The Sun's version of robots.txt is interesting chiefly for the fact that it gives away the address of The Sun's Polish edition. Apparently mocked up as part of an internal staff competition, listing the address as a disallow is intended to keep out the prying eyes of search engines, but at the same time it lets us humans know exactly where to find it - or at least where to find it yesterday afternoon - it appears to be 404ing right now.

The Sun in Polish

I wonder if their Polish language RSS feeds work - because their English language ones certainly don't at the moment.

Three of the newspapers I looked at didn't appear to have a robots.txt file at all - The Daily Express, Daily Star and The Independent. Of course, it isn't mandatory to have one, and they aren't doing anything wrong if they don't have any specific instructions for search engines, but it is generally held to be good website admin to have one.

Not having a robots.txt also rather implies that nobody checks the server error logs for those three sites very regularly. The robots.txt file is one that is called time and time again, and if you don't have one, your error logs are clogged up with entirely unnecessary lines of 404 errors, generated as polite spiders check to see if they should obey any specific instructions.

No comments yet
Leave your comment


Alan Turing wouldn't be impressed with this crude test,
but please prove you are a person and type toothpaste into this box:
  

A limited set of HTML tags are allowed in comments: a href, strong, em, ul, li, blockquote
To protect against spam your comments will not appear on the site until I have manually published them.
* Your email address will never appear on the site.

Search

Subscribe

Subscribe via email or RSS RSS icon
Get updates to currybetdotnet sent to you via email

About Martin Belam

I'm an Internet consultant and writer, with 8 years experience in product management, information architecture, and user experience design for global brands like Sony, Vodafone, The Guardian and the BBC. I specialise in advising on search, widgets, RSS, online news publishing and bulk email delivery.
Martin Belam CV
email: martin.belam@currybet.net
tel: +44 (0) 7801 828718
About Martin Belam and this site

Popular categories

BBC, Doctor Who, Ghost Walks, Media, Music, Newspapers, Search, Web

See all Categories