“To clean or not to clean” - Placr’s Jonathan Raper talks open data at FutureEverything
At FutureEverything I took part in a panel session about data journalism, and also saw a couple of presentations that touched on how the use and reuse of data was making a difference to 21st century services.
He was presenting a manifesto for open transport data - with the proviso that “manifestos are not finished, or necessarily right”.
Raper said that the principle behind open transport data is the assumption that releasing it will help us save money, create jobs, increase transparency, and apply scrutiny and competition to areas of transport infrastructure that are currently closed. The challenge, he said, is to reorganise both central and local government to get the data out there. The consequence of delay is that fragile start-ups replying on re-purposing that data may not succeed.
Data must be free as in “free beer” as well as in “free speech”, because businesses need to have low costs. He derided the use of complex licences - a start-up doesn’t have time to spend two days reading the documents, he said, and then three weeks teasing out the real meaning with the lawyers.
Another issue he identified was that of privileged operators gaining an unfair advantage. As an example, he cited businesses that had built apps around the data produced by the London bicycle hire scheme, only to see sponsor Barclays release an app that was able to access additional data not available through the public API. With prior knowledge that this was happening, those business may have chosen to invest in different apps.
Raper was passionate and persuasive with his message:
“Please let me still be here in a year’s time, and not bankrupt and a footnote in history because we didn’t move fast enough in freeing data.”
There was one area of his talk where I don’t think there was universal consensus. He argued that the public sector is wasting a lot of time and money in “cleaning” data before it is available for re-use in the public.
In response to a question from someone at the British Museum about best practice, Raper argued that lots of the tidying up of data involves doing things that Placr wouldn’t necessarily do - i.e. the removal of database fields that the public body doesn’t forsee a use for. “The more raw the data is”, he said, “the better it is for us”. He claimed that if the point of data release is to save money, then “any time spent cleaning it up is wasted public money”.
I’m unconvinced by his argument myself.
Firstly, as we know at the Guardian, a lot of the work involved in publishing datasets on the Datablog is down to data being issued in poor formats. It isn’t clear to me that Raper’s approach of issuing data full of errors, and let the private sector tidy it individually, potentially many times over, is better than encouraging the public sector to adopt better data practice.
Secondly, there are scenarios where the public sector absolutely has a duty to pre-process datasets before giving them public release. At the BBC, for example, a prototype of releasing the decades worth of programme data stored in the INFAX database eventually succumbed to a host of legal issues. Had the BBC just turned the data over to a third party without doing due diligence on it, it could have been very costly. Likewise you might recall the furore when AOL provided a dump of “anonymised” search log data for research purposes, but didn’t do enough work to fully anonymise it. Any public body in the UK offering user submitted data for analysis would need to ensure strict privacy standards were adhered to.
Overall though I really enjoyed Jonathan’s talk, and he makes a powerful advocate for opening data. You can see some of the work his company have done at Placr.co.uk/gallery.