Wikipedia deletions and #linkeddata implications

 by Martin Belam, 9 June 2010

If you missed the update to my post on Tuesday about the deletion of First Aid Kit's Wikipedia page, I'm pleased to be able to point out that within a couple of hours of blogging about it, a Wikipedia administrator saw the post and went and restored the page. I now, of course, feel somewhat honour-bound to contribute and improve the quality of it. My post made an appearance in the Guardian Tech newsbucket, and also on the Y Combinator Hacker news feed, which sent a deluge of traffic which the site wasn't able to cope with very well - sorry about that.

The opening comment on the Y Combinator thread was well reasoned but critical of my piece:

"This guy is upset that Wikipedia deleted an article about a Swedish indie-rock band that appears to have been covered in depth by one alt-weekly in Vancouver and nobody else --- an article, incidentally, that was posted and deleted 3 times before that alt-weekly one-pager was published.
Wikipedia is not an effort to organize all the world's information. That's Google. Wikipedia is an effort to build the world's best encyclopedia. The difference between an 'encyclopedia' and 'all the world's information' is that the information in an encylopedia needs to be reliable. To ensure that the information in the encylopedia has a chance at being reliable, the encyclopedia is constrained to information that can be written about notable topics cover[ed] in reliable secondary sources.
Virtually everybody who writes an article about a non-notable topic ends up objecting, often loudly, when the article is deleted. That's understandable. Wikipedia could do a better job of warning people of the bar their topic needs to clear. But they can't make resurrection of deleted articles trivial to anybody, or they will spend all their time re-litigating deletions.
The likelihood that the particular 'speedy deletion' policy this article complains about will ever be resolved is epsilon. Speedy deletion, particularly of no-name bands, vanity books, websites, and tiny companies is almost the first line of defense against article-creep. Changing the policy would be an existential change to the way WP is managed.
Which doesn't matter, because you can resurrect speedy'd articles already; you just need to take the article to Deletion Review and make a case for it. Maybe WP needs an article on First Aid Kit. I like Fleet Foxes, too! (WP has excellent coverage of Fleet Foxes). But WP is run by human beings donating their time, and people make mistakes, and it is utterly disingenuous to pretend like First Aid Kit is an obvious 'keep'."

I just want to stress again that it wasn't my article, and although I like the band, I'm not some kind of super-fanboy.

What worried me about it - and still does - is that away from the Wikipedia community, we are building a whole linked data ecosystem which relies, in one way or another, on Wikipedia. It is no accident that dbpedia - a project to extract structured reliable data from the wiki - is one of the biggest and most central nodes on the #linkeddata diagram.

Linked Data Universe July 2009

Sites are increasingly relying on constructions like a MusicBrainz ID pointing at a Wikipedia entry which in turn has a dbpedia equivalent full of lovely, lovely structured data about the subject. An inclusive Wikipedia that accepts that it has no paper limit and therefore can be utterly comprehensive (if harder to maintain) is central to the premise.

The risk is that if Wikipedians who believe it should have a much more limited scope gain the upper hand. Inclusion rules for musical artists might begin to involve chart appearances, or sales thresholds, or any other number of limiting factors, which will be fine if you only want to build #linkeddata sites that revolve around The Beatles and Michael Jackson, but mean you are going to struggle to use the data to go beyond the mainstream.

I prefer my Wikipedia as a wide-ranging collaborative knowledge repository, not as an arbiter of taste or cultural impact.

They aren't always going to get it right.

As someone pointed out in the comments thread on Y Combinator, at one point the Lady GaGa entry was deleted.


I had no idea that Wikipedia was highly selective. Given the many inaccuracies I've seen there, this is a revelation to me.

Lady GaGa was deleted, yet no one reckons a entry about Ampere Way tram stop isn't worthwhile...

That says it all to me about Wikipedia.

To be fair to Wikipedia, the Linked Data implications of its encyclopaedia are really only a minor secondary consequence of its main aim of being a content resource for humans.

You can see this tension in 'list' pages (such as 'List_of_Harry_Potter_supporting_characters') - where it'd be better from a Linked Data perspective to have a page-per-character, but from a human perspective, it's arguably better to have them on one page, given that there's so little to say about each character.

When it comes to bands, albums, books, films, etc, there's bound to be a huge gap between 'everything that's ever been published' and what's on Wikipedia. There's 'only' currently 3.3 million articles on Wikipedia at the moment, after all (whereas the number of books published in the last ten years must exceed this).

For Linked Data, I suspect that it's useful to have a central resource limited to 'notable' generic topics, with more specialised resources (MusicBrainz, BBC Programmes, etc) which cover a particular domain comprehensively...

DBpedia relies on Wikipedia because Wikipedia does its job well. As soon as you make it harder to maintain, it doesn't do its job as well... people simply don't have the time to ensure that the information is correct. And then it doesn't make sense to use it as the source for DBpedia anymore.

I personally know of hundreds of bands in the Pittsburgh area that meet the criteria of having an alt-weekly write an article. By setting the bar this low, I don't just think you make Wikipedia harder to maintain, you make it unmaintainable. You basically bring all of MySpace into Wikipedia and expect volunteers to fact check it.

Alt-weeklys and other periodicals are really the place for bands that haven't established themselves. Encyclopedias are the place to record things that were historically momentous enough to actually affect a group of people's lives.

I'm an avid supporter of local indie music and don't think that data from these less mainstream bands is unimportant, but I don't think it belongs in the Wikipedia or DBpedia dataset.

My main point isn't about the merits of First Aid Kit, but I do find it astonishing that nobody wanting to defend Wikipedia appears to be doing anything more than citing internal Wikipedia logic about them not being notable, or is fact-checking the assertion they've only appeared in alt-weeklys. It takes only seconds on the web to find coverage of the band on The Guardian, BBC, Dazed and Confused, and that they are signed to the UK's Wichita label which has a track record of producing successful bands...

Keep up to date on my new blog