Richard Pope talks about ScraperWiki at The Guardian

 by Martin Belam, 20 October 2010

One of the ways that our developers keep across stuff happening externally to Guardian.co.uk is by inviting people in to give "Tech talks" on a regular basis. Last week we were visited by Richard Pope of ScraperWiki, a service which has been causing a great deal of interest in the datajournalism community. Here are 3 key points that I got from the talk:

It is sometimes easier to learn to program than to learn tools

Richard made the point that learning some of the tools to capture and structure data is much harder than learning the couple of lines of code you'd need to achieve the same effect. The problem, of course, if you know nothing about programming, is how you find out exactly which 7 lines of code it is that you need out of the millions you could potentially learn. If you are at all of a technical nature with computers, it is easy to forget just how intimidating the command line or a terminal window is for people entirely used to working a GUI. That is one of the reasons people tolerate bad software user experience.

Our developers have dark thoughts...

...not in an evil stalking serial killer kind of way, but in a Dr Pepper style way of imagining the worst that could happen. Quite a few of the questions that came from our software team at the Guardian had a focus on what could go wrong. How did Scraperwiki cope with people maliciously altering datasets, or the scheduling of a lot of tasks against a specific URL in a form of denial of service attack, or using the service to violate copyright and IP?

Scraping is hard, but a scraping tool is valuable

Richard Pope showed a graph of how, over time, the number of planning applications being measured by the site Planning Alerts decayed. He said that whilst it could be simply an indication of the recession, it was most likely an indication of how scrapers degrade over time if not maintained. By their very nature, scrapers are brittle.

Given that in the UK the state is increasingly releasing reliable datasets, there is an open issue about how necessary a data tool like this is anymore. He pointed out that in the UK we have Tim Berners-Lee knocking on the government's door to get everything published in formats that are nice-to-play-with. This impetus for data release is not yet global. Pope argued that to really monitor, for example, the global footprint of a company like Tesco, there will for the foreseeable future be plenty of bits of data you'd require, where scraping websites into a database is really the only option.

Keep up to date on my new blog