Martin Belam - "Linked data / Linked stories" at FutureEverything

by Martin Belam, 12 May 2011

A brief return from my self-imposed blogging exile this week - today I am in Manchester talking on a panel about data journalism at the FutureEverything festival and conference. This is the script that I’ve written for my five minute opening monologue. It seems way too long to fill just five minutes...

Martin Belam - “Linked data / Linked stories” at FutureEverything

I often feel a bit of a fraud talking at these kinds of events. Just as at Hacks/Hackers events where I am neither a hack or a hacker, I’m here on a panel about data journalism whilst being neither a journalist or a day-to-pay programmer using data. When I get invited to something like this, it usually means I’m acting as a proxy for my colleague, Datablog editor Simon Rogers.

However, my role as Lead User Experience & Information Architect at The Guardian places me in the central intersection of a Venn diagram that features journalism, technology and “the audience formerly known as the readers”. Chris Heathcote describes web design in 2011 as “wrangling an invisible data stream”, and certainly on a news site, whether it is data-driven journalism, the attention data that shapes navigation, or the raw server logs, data is increasingly important.

There are four main points I think worth raising today.

Data quality is paramount. And variable.

There is an increasing volume of data being released as the exhaust fumes of the digital revolution, whether these are official state datasets, or the data trails left by users interacting with the web. The quality of data is variable though. A considerable amount of the Guardian Datablog team’s time is taken up with cleaning data up into a publishable state.

And even when it is clean, it isn’t always reusable.

The Guardian has a “Rosetta Stone” spreadsheet which tries to unify the identifiers across datasets, but even with relatively small datasets like the names of nation states, you’ll find one source referring to Burma, and another to Myanmar.

This is problematic for people trying to process or re-combine the information. In fact, the heaviest users of the Datablog can be the most vocal critics of the shortcomings - something we try to take as a positive sign of active engagement in what we are trying to do.

Some of the WikiLeaks data showed an incredible difference between civil and military data. Just as with the UK council spending datasets, the leaked US diplomatic cables featured a lot of inconsistency and idiosyncratic phrasing of information. The War Logs from Iraq and Afghanistan, however, showed such precision in recording and applying metadata to events, that one of our developers, Daithí Ó Crualaoich, has suggested that retiring US servicemen might find gainful civilian employment as sub-editors.

Data is softer than it looks

I’ve seen Michael Blastland speak twice this year, and his statistical approach really makes you question the way that mainstream media and politicians frequently use numbers in a naive way to “prove” arguments.

There are 3 key ways he identifies this happening:

Small sample sizes lead to wide fluctuations. This generates reports like the FT claiming that a “manufacturing bounce” was happening in small towns, based on some high growth figures in small samples.
Acting when numbers are at a peak. Road casualties, for example, involve an element of chance. The law of probability suggests that an abnormally high number in one location in one year is likely to be followed by a lower one closer to the mean the next, regardless of installing a camera or traffic calming measures.
Failing to show error margins. Data in the media is often presented in the form of league tables - particularly crime rates and school performance. Very often no margin of error or “confidence” level is given with the figures. Tables give an illusion of “absolute” ranking that often the underlying methodology doesn’t support.

Using search to explore data

Another significant change with the use of data for journalism is searchability. Or rather, we hope, findability.

It used to be that journalists sifted and curated data to determine a story, but now news organisations provide the raw tools to enable story-telling. At The Guardian, the Datablog publishes spreadsheets crammed with the information that powers our journalism. As well as this, we have a World Government Data search engine, which provides a one-stop location to find and compare state data from around the globe. We also produce specific tools like the COINS database explorer looking at government spending figures.

Care must be taken when constructing these tools in order to shape something useful for the audience. For the World Government Data search I used Guardian site search logs to make sure the subject areas we highlighted matched the interests of our users. As I’ve written before, Guardian Lead Interactive Technologist Alastair Dant cautions that:

“You can have wonderful visions of the audience putting on their magic Matrix datagloves and freely exploring reams of data, but without proper guidance and editorial framing, they may well find themselves exploring for a long time and discovering nothing.”

I’ve also seen Scott Byrne-Fraser from the BBC interactives team talking at a Hacks/Hackers meeting, warning that:

“Most people don’t want to fine tune a chart and play with the variables. Their attitude is ‘You guys get paid to work out what’s important here - just tell me’”

Lie dream of linked data soul^*

For me the really exciting potential for saving journalists time and helping them in uncovering stories is in the sphere of linked data. I dream of a world where everything has a permanent interoperable unique identifier and URI.

Answering questions with data can be a slog, but effectively linked and cross-referenced data could transform painful research tasks into simple database queries. Zach Beauvais of Talis recently gave this example of the potential of linked data in an article I commissioned for FUMSI magazine.

Say you wanted to investigate whether University funding cuts were falling disproportionately on those that didn’t have a Conservative as their local MP. If you had a spreadsheet of every spending increase and decrease from the universities, you could also lookup the postcode of each university. With that postcode, now matched to the institution, you could find the constituency it resides in and identify the sitting MP. With the MP known, you just need to match them with their party. Finally, it would only take a quick comparison between parties and you’ve got your answer.

Each of those steps is a relatively simple piece of data processing, but it is the linking them up in a chain which has the potential to save hours and hours of research time. We are some way off that scenario yet, but the direction of travel seems set. Governments around the world have been moving towards more open data, and more linked data with the launches of data.gov.uk and .gov and .govt.nz and others. Volunteers and developers have also given us a glimpse of the possibilities by building services like Open Corporates.

The Guardian is trying to link up our content with the wider web using persistent external references. For example, using our API you can enter an ISBN to see if we have reviewed a book, or use a MusicBrainz ID to see if we have a keyword tag for a particular artist.

The challenges for linking up news stories are considerable of course. Will it be possible in future to have an interoperable news taxonomy that allows the user to “follow” stories like the death of Osama Bin Laden across a variety of news sources, because it has been defined as “a story” with a unique identifiable ID? Or is the best that we can hope for is some interoperable index of broader topics?

Next...

The panel in Manchester today consists of Paul Bradshaw, Chris Taggart, David Higgerson and is being chaired by Sarah Hartley, and I’m sure after our opening salvos we will have a really good debate about the good, bad and ugly of data journalism. I expect I’ll have some notes from the rest of FutureEverything when I resume this blog later in the year.

* FutureEverything is in Manchester, so it seemed only right to cram an oblique Mark E. Smith reference into this piece somehow...

currybetdotnet: best of the blog 2011 cover thumbnail

“currybetdotnet: Best of the blog 2011” brings together over 50 of the best posts on this blog from 2011, covering topics such as live blogging, community and social media for news websites, and the future of digital media. It features write-ups of talks by Guardian journalists including Paul Lewis, Matthew Wells, Andrew Sparrow and Chris Elliot, and behind the scenes looks at Guardian products like the Facebook and iPad apps. It also has transcripts of Martin Belam's talks at EuroIA, the UPA conference, Polish IA Summit, Content Strategy Forum 2011, FutureEverything and Hacks/Hackers London.
“currybetdotnet: Best of the blog 2011” for Kindle is £1.92.

3 Comments

while that was definitely informative and interesting, i'd like to wish you luck on fitting that speech into a 5 minute time frame LOL.... you just better hope that no one interrupts and asks a question.

i must say i admire how you are a guest speaker on panels which you even admit you don't belong. it shows how much respect others in the industry have for you.

Tyler | 13 May 2011

@Martin Belam: your article is good but please next time write in more clear words,so that people like me who dont understand english very well could understand your articles.

Anonymous | 13 May 2011

Hope you managed to squeeze it in to your five minutes opening. Interesting opening. Make me want to hear more!

Svend Hansen | 15 May 2011

Martin Belam - "Linked data / Linked stories" at FutureEverything