Does 'Linked Data' need human readable URIs?

Martin Belam by Martin Belam, 1 March 2010

Last week I went to the 2nd London Linked Data meet-up, and one of the topics that came up was whether open linked data should have human readable URIs.

Now, a few days ago I was pointing out how The Guardian's URL combiner structure made it easy to add /football/ashley-cole to /culture/cheryl-cole to get a landing page joining the stories about the two together at guardian.co.uk/culture/cheryl-cole+football/ashley-cole. It is the human readability that makes our URLs easily hackable. [1]

In one of the panel sessions at the Linked Data meet-up, there was some debate about whether open linked data URIs should be human-readable. The implementation of persistent human readable URIs is tricky, but personally I favour the human readable every time.

During the meet-up session Jeni Tennison stressed that she was also in favour of the human readable. A question from the floor said they were concerned that by using the English language and imposing data structure in URIs we were limiting persistence and reusability. Jeni thought it was right that the UK Government should 'mint' URIs in English, and that the Chinese could 'mint' URIs in Chinese. She argued that 'machines don't need structure or readability, but humans do, and it is humans that write the programs that have to process the data'.

Tom Scott from the BBC took the opposite view. For areas where there would be lots and lots of individual bits of data, like programmes or music artists, he was very much sure that machine readable URIs were the right approach. Indeed, he said that he regretted the areas where the BBC had based their URL structure on copying Wikipedia's URL structure, because it wasn't stable enough. He said he would 'choose persistence over readability every time'. [In the comments below Tom suggests I've misquoted him]

Let's be honest though, this is all just about picking the right level of abstraction.

At some point, every URI or URL is a representation of some binary numbers in a database. Even if you have to maintain a complex matrix of redirect instructions, it is an awful lot quicker for a machine to recognise that /culture/cheryl-cole needs to point to /culture/cheryl-tweedy than it is for a person to parse her MusicBrainz ID of 2d499150-1c42-4ffb-a90c-1cc635519d33.

Next...

Tomorrow I'll have a round up of the remaining notes I made on the day.



[1] As asked on currybetdotnet the other day in a comment by Andrea Chellin: 'Am I the only one who finds it somewhat ironic that the tabloid Cheryl/Ashley Cole story should be filed under 'culture' on the Guardian site?' [Return to article]

17 Comments

Thanks very much for this, Martin!

The, um, discussion over whether identifiers in the digital, networked environment must be (or even SHOULD be) "human readable" has ranged since the mid-1990s; for example, the publishing industry fretted for years over whether the DOI should enforce opacity or readability in its syntax.

As you aptly put it, this is all just about picking the right level of abstraction... Another way to express this is, what matters is how we want our identifiers to be used. In the linked data world we need to maximize this reach. I would argue that not only do machines not care, but as with the data they point to, implicit claims that URI generators make within their syntax can't be trusted!

Every HTTP URI, whether (apparently) human readable or opaque, must be treated as merely a link. What really matters is the useful information that has been published using the standards and verified using similar and consistent methods, using that link.

thanks for posting on this subject, martin. well said.

i believe the question is simple: how much does it cost the Guardian editorially and operationally to "mint" those human-readable URLs? it's an amount of money above "none". if the cost is "worth it" to the org and product, then human-readable is the way to go. but how can this be objectively determined?

i've been thinking about this since the debate last week, and i reckon it'd cost the BBC: ~£300,000-400,000 as a one-off to take care of all the programmes from the archive of approx 85 years, then ~200,000 per year going forward... the cost is mainly for metadata-editor staff to type in unique, human-meaningful stubs, and for the tools they'd need to use.

this would give a unique, human-readable identifier/URL stub for each and every programme (as in Waking the Dead), tho NOT for each and every episode (as in episode 4, series 5 of Waking the Dead).

so, to ensure the BBC had unique, human-readable identifiers/URL stubs for each and every programme, the cost would be significant. (and note that this finger-in-the-air estimate is just for programmes data, not Music or, golly, News...)

but is the cost of human-readable URLs for vast datasets prohibitive? that's a subjective business call -- & i don't think we know enough about how to quantify the perceived benefit to the user of human-readable URLs yet. nor, for that matter, the benefit of persistent IDs.

which is one reason why this debate is somewhat religious in tone, i guess.

I wonder if the debate over human readable URIs is a result of the predominant web browser UI design. That is to say, we directly expose the URI to the user, hence we feel we need to allow them to parse some meaning from it.

At its heart, however, the URI is meant to uniquely identify concepts, for the benefit of the machine.

I know it's a big 'if', but if we moved to structure our information around unique concepts, then we wouldn't have to directly expose the raw URI to the general audience - and thus human-readability would be less of an issue...

Some response to this on Twitter as well...

@olyerickson: Persistent: +1 Human readable: -1

@fantasticlife: hhmm. is not that they're not desirable. just whether they're worth the human effort it takes to create and maintain?!?

@gkob: I agree with @derivadow's argument that stability of URIs beats readability. still, I can see the desire for readability.

@frankieroberto: hmm, not sure it's as simple as "picking the right level of abstraction". I must write up my notes on that debate too...

@Paul people have been suggesting for years that URLs might cease to become part of the user interface (see Jakob Nielsen giving them a 3-5 year shelf-life in 1999: http://www.useit.com/alertbox/990321.html). However, I don't believe that this will ever happen.

Mainly because, in a self-reinforcing way, URLs DO currently contain information that's really valuable to the users. It's for this reason that Google shows the URLs so prominently in search results, for instance - they give the user a way of quickly verifying both who the site is from (the domain name part) and what the key subject of the page is (the path).

I also believe that they're equally important as a user interface for developers. Because even though you're not, in theory, based to construct and deconstruct URLs based on their paths, in reality, this is often quicker and easier than fetching the data. The URL design also gives a hint as to the structure of the site and its underlying data, which helps frame the mental model of developers building on top of it.

These two reasons are why I believe it's better to be pragmatic, and design your URLs to be as understandable and hackable as possible (within the constrains of practicality and cost), rather than to be dogmatic and insist they don't matter.

@Frankie - agreed, better to be pragmatic in the short term and cater for what's out there at the moment - but I'd hope that in the long run there was serious investigation of alternative UIs.

We need (in so many areas!) to look beyond the immediate and think of alternatives, rather than accepting what's current as being always true...

(much easier said than done, and probably not all true!)

Karen Loasby, who works at the RNIB, threw this into the mix on Twitter: "I know I'm getting a bit single issue but I also have to think about it as human-listenable URIs"

I think this one is gonna haunt us forever :-)

Guess the first point is this isn't a LOD issue (although general web of data makes it more important). Even if you're not publishing rdf, don't care about information resources vs non-information resources and aren't too interested in the finer points of content negotiation persistence is obviously important. Even if you're just including microformats adding rel=mes and rel=contacts you run the rsk of publishing some pretty dubious claims to the web if your uris move / get overwritten.

Anyway persistence is important to users who bookmark things and link to things. (During the recent bad weather my daughter's school updated their website, kept the old news page up and didn't redirect to the new one; which resulted in a wasted and wet walk to a closed school for me. Is not the worst thing in the world but it's easy to imagine situations were the consequences could be more severe. Seem to half remember there was a story that wiped 1/4 the share value off an airline cos it turned up in the wrong place when someone refactored the Reuters API URIs)

Obviously persistence is also important to search engines. If your pages move and you don't 301 your google juice disappears. but i realise i'm preaching to the converted on that one :-)

The trouble is whilst paul's probably correct in saying uris weren't really intended to be part of the UI (the earliest web browser kept them well hidden), through a combination of marketing and immersion people have got used to seeing them and typing them.

So how do you get persistence and human readability when human labels are ambiguous? Usual bbc example is there have been 7 bbc programmes called "the office" before the one that got famous. Which one gets /theoffice and which gets /theoffice2. At some stage you need human intervention and that costs money

As Chris says generating unique human readable identifiers has an associated cost and it's not just a one-off cos once they're readable people will want to change them so you have the added cost of maintaining redirects as this happens.

Guess it's just a call about how much you value persistence and whether the money you spend on matching human-readable to persistent would be better spent elsewhere (like on content)

(NB There's also the issue of "Cool" URIs http://www.w3.org/Provider/Style/URI)

I'm of the opinion that URIs should be human readable BUT that doesn't preclude having a "simple" machine readable URI as well.

All BBC programmes seem to have an ID.
b00phwkz is Russell Howard's Good News.
So we have
http://www.bbc.co.uk/programmes/b00phwkz
and
http://www.bbc.co.uk/iplayer/episode/b00p8h5p/

Great for computers, but not for us pesky humans.

You could have
www.bbc.co.uk/programmes/Russell_Howards_Good_News_Series_1
and
www.bbc.co.uk/iplayer/Russell_Howards_Good_News_Series_1_Episode_8/
Which could 302 it to the "machine" readable URI.

T

@terence We kinda do. For top level objects only but try:

http://www.bbc.co.uk/programmes/Russell_Howards_Good_News

In this case there's 2 items. if there were only one it'd 302 to the pid based uri...

Just to add quickly, I think I agree with the judgement that having human-readable URIs for programme episodes isn't worth the extra effort, given the huge number and complexities of them (eg not all programmes are called "Series X, Episode Y" - many simply have numbers, or titles, or dates, and so on).

However, for 'animal' type pages, I think it's worth re-using the Wikipedia slugs, rather than the BBC minting new ones - even though simply re-using Wikipedia slugs also has a cost & effort implication (given that the URLs do change).

So judgements will differ in different circumstances, but I do think that it is a genuine trade-off, rather than persistence (which non-human-readable URIs helps with) ALWAYS beating human-readability.

BTW Even if you're using non-human-readable URIs, human usability still comes into play. The PIDS used by BBC programmes are much shorter (and hence usable) than Music Brainz ids. ISBNs even have a form of usability built in through the use of a 'check digit'.

I think we need to take a step back for a moment and look and what people are subliminally trying to do, using human readable uris. Often the conversation revolves around their predictive or hackable nature. So what are people trying to do ? Search of course. And thats the behaviour we build into the /music and /programmes uris. You're invoking a GET request on search URI and getting back a resource that corresponds to the results of that search. You can 302 if the result set is no greater than one but that just seems like an implementation detail.

@Anthony That's a nice feature.

As we start doing more and more of this (publishing data as linked data at well-designed URLs), I think some of the different options, and the best design patterns, will start to emerge through experience.

I think I've been a little misquoted here :)

I would choose persistence over human readability (but not for linked data reasons).

The instability of Wikipedia URLs and therefore BBC Wildlife Finder URLs is a bit of a pain but I certainly don't regret using Wikipedia URL slugs. What I regret is including .../species/..., .../genus/... etc. in the URL I regret it because it means that you need to know that, for example, a gorilla is a genus if you want to guess the URL i.e. they aren't human guessable unless you know a bit of biology.

Hi Tom, I've added a link in the body of the article to your comment - apologies if I misinterpreted what you said - I was taking notes using predictive text on my N95 for that session. That's my excuse and I'm sticking to it...

Incidentally, I've published a follow up blog post to this discussion.

@Martin I don't think you were the only one to have misunderstood Tom's comment... Sorry Tom!

The point about the /:genus/ part being a mistake is much more understandable! (My knowledge of the Linnean taxonomy is pretty poor too)

Hi Martin, I believe Normally-generated URLs are barely human readable. I agree I'm in favor of human readable urls too for they are easy to memorize and familiarize rather than the clunky ones. It's more of putting up a nice URL made up of meaningful elements.

Keep up to date on my new blog