“To clean or not to clean” - Placr’s Jonathan Raper talks open data at FutureEverything

 by Martin Belam, 10 June 2011

At FutureEverything I took part in a panel session about data journalism, and also saw a couple of presentations that touched on how the use and reuse of data was making a difference to 21st century services.

One of the most passionate of those talks was by Jonathan Raper of Placr.

He was presenting a manifesto for open transport data - with the proviso that “manifestos are not finished, or necessarily right”.

Raper said that the principle behind open transport data is the assumption that releasing it will help us save money, create jobs, increase transparency, and apply scrutiny and competition to areas of transport infrastructure that are currently closed. The challenge, he said, is to reorganise both central and local government to get the data out there. The consequence of delay is that fragile start-ups replying on re-purposing that data may not succeed.

Data must be free as in “free beer” as well as in “free speech”, because businesses need to have low costs. He derided the use of complex licences - a start-up doesn’t have time to spend two days reading the documents, he said, and then three weeks teasing out the real meaning with the lawyers.

Another issue he identified was that of privileged operators gaining an unfair advantage. As an example, he cited businesses that had built apps around the data produced by the London bicycle hire scheme, only to see sponsor Barclays release an app that was able to access additional data not available through the public API. With prior knowledge that this was happening, those business may have chosen to invest in different apps.

Raper was passionate and persuasive with his message:

“Please let me still be here in a year’s time, and not bankrupt and a footnote in history because we didn’t move fast enough in freeing data.”

There was one area of his talk where I don’t think there was universal consensus. He argued that the public sector is wasting a lot of time and money in “cleaning” data before it is available for re-use in the public.

In response to a question from someone at the British Museum about best practice, Raper argued that lots of the tidying up of data involves doing things that Placr wouldn’t necessarily do - i.e. the removal of database fields that the public body doesn’t forsee a use for. “The more raw the data is”, he said, “the better it is for us”. He claimed that if the point of data release is to save money, then “any time spent cleaning it up is wasted public money”.

I’m unconvinced by his argument myself.

Firstly, as we know at the Guardian, a lot of the work involved in publishing datasets on the Datablog is down to data being issued in poor formats. It isn’t clear to me that Raper’s approach of issuing data full of errors, and let the private sector tidy it individually, potentially many times over, is better than encouraging the public sector to adopt better data practice.

Secondly, there are scenarios where the public sector absolutely has a duty to pre-process datasets before giving them public release. At the BBC, for example, a prototype of releasing the decades worth of programme data stored in the INFAX database eventually succumbed to a host of legal issues. Had the BBC just turned the data over to a third party without doing due diligence on it, it could have been very costly. Likewise you might recall the furore when AOL provided a dump of “anonymised” search log data for research purposes, but didn’t do enough work to fully anonymise it. Any public body in the UK offering user submitted data for analysis would need to ensure strict privacy standards were adhered to.

Overall though I really enjoyed Jonathan’s talk, and he makes a powerful advocate for opening data. You can see some of the work his company have done at Placr.co.uk/gallery.

3 Comments

Martin,

Thanks very much for getting stuck into this debate about cleaning up public data. I formed my view that data releases should be raw not because I want to see poor quality out there, but because the greater dangers are 1. delay, and 2. the wrong kind of cleaning. An ideal mid point would be a little local spending by local authorities on an 'invest to save' basis to fund local SMEs to clean up the data so that developers can both understand it and expose it as-a-service.

We were promised real time London iBus data in 2009... it's on many of London's bus stops in plain view... but they are still polishing it 2 years later. They are insisting on following through on a project that sees people SMS'ing a code off the bus stop and getting back a message with the next buses. There has been so much cleaning that the data has vanished.

We are also waiting and hoping for raw data from Network Rail... the alternative cleaned data has been the restrictive licensing practised by National Rail Enquiries under a very one-sided Code of Practice (Details here).

So in the best of all worlds... good quality data from enlightened public services is what we want released as open data. But to get the momentum... to build the evidence base for the success of open data... we need, to quote Tim B-L, "Raw Data Now".

Jonathan

Let's say you want data set X.

Do you want it:

a) Now, but in a slightly mucky state? Or

b) At some other random date in the future no earlier than too late for this project, and no later than the total heat death of the universe?

If the originating body is concerned about data quality (all of a sudden, for the first time ever, and I'm sure it's only a coincidence that they were just fine with the 'uncleaned' data right up until the nanosecond that I asked for it and pulling this kind of crap allows them to stall the request), then it can fix the data quality on its own time, not mine.

And why is the data crap in the first place?

Because we're not looking at it and shaming the morons generating it in public *already*.

I have been involved in working with public bodies to release data through the Open Data Cities and DataGM project in Manchester, and for the most part agree with Jonathan Raper's position.

Data isn't in a fit state for release is not an uncommon argument. Which begs the question of, why are the public paying for authorities to create crap data? And why is this being sustained?

Surely if the raw data was released it would be in the best interests of developers and public to make the data useable. When bus schedule data was released through DataGM, a developer found a number of anomalies whilst making the data useable for their purposes. The transport authority was notified and hopefully the anomalies, if verified have been corrected. These errors wouldn't of been identified if the data had not been released and would of continued to feed into national datasets without anyone, apart from an end user noticing.

The standardisation and cleaning approach has slowed things down and given people another excuse to delay release or maintain their own positions.

Swirrl a company that creates linked data out of public datasets recently created linked data from the bus schedules, allowing a number of developers to create applications off the back of the enhanced data. This was done by a commercial organisation because they saw a need and opportunity.

There are numerous GIS systems that are in operation in public bodies, some have as many as 30, and each one is seen as being the best one by their respective departments. This way of working is untenable and is also used as a means of holding up the release of data. You hear stories of public bodies having ten year convergence plans but surely this is madness. Through discussions with a number of people in local authorities, it is starting to be recognised that making data available now could create a change in the system, that would not be achievable if the process was done internally.

Keep up to date on my new blog