The software used to access the BBC homepage
Studying the software that visits the BBC homepage
It started with a casual enquiry from a colleague - "I wonder how many Firefox users visit the BBC homepage?" - and before I knew it I was involved in a lengthy statistical analysis of the browsers and operating systems that request the BBC homepage at http://www.bbc.co.uk.
Our old stats reporting tool at the BBC gives a breakdown of requests from different user agent strings, which is where the browsers and operating systems people use to navigate around the web leave their digital fingerprints. It is about to be phased out in favour of a new solution, but I'm not sure that the new system gives the same granularity of data, so once I'd started, I thought I'd look at the figures in some detail before the old system gives up the ghost.
Now if you've never looked at user agent strings, they are rather dull and geeky, and full of lots of technical gubbins like these examples:
- Mozilla/5.0 (Windows; U; Windows NT 5.2; en-GB; rv:1.7.10) Gecko/20050717 Firefox/1.0.6
- Mozilla/5.0 (Macintosh; U; PPC Mac OS X; en-us) AppleWebKit/85.7 (KHTML, like Gecko) Safari/85.5
- Mozilla/4.0 (compatible; MSIE 6.0; America Online Browser 1.1; Windows NT 5.1; SV1; .NET CLR 1.1.4322)
- Mozilla/4.0 NETIKUS.NET GetHttp v1.0
- Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; Hotbar 220.127.116.11)
- Mozilla/4.0 (compatible; MSIE 6.0; Windows CE)
- Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.5) Gecko/20031007 Firebird/0.7
- Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.0; T312461; BT [build 60A])
There are of course some caveats around the figures I'm about to talk about.
User agent strings aren't an exact science. Or rather, they ought to be, but in the real world the come out a right mess. I've done my best to untangle them, but I still ended up with a significant number of user agents that I could not identify properly. And that is before we get started on the corporate networks that use the UA string to broadcast their corporate branding to the world whilst masking their operating system. Or requests claiming to come from both Internet Explorer 6 and Internet Explorer 5.5. Or that claim to be from a particular Linux distribution and Windows 98 at the same time. Or the plain weird like the inadvisably named KummClient from Hungary that proudly proclaims 'Linux rulez' to anyone like me dull enough to be delving through their logfiles.
User agent statistics on something as big as the BBC homepage could almost be the very definition of the long tail. The most popular user agent string - IE6 on Windows XP - clocked up nearly 6 million requests. I only counted user agents that had made more than 50 requests, but between 6 million and 50 requests there were nearly 11,000 different user agents to look at. Examining that number of requests accounted for 95% of the reported traffic, but only around 1/3 of the stats report. I initially suspected that counting the whole of the tail was likely to increase the market share I derived for the quirkier set-ups, but a random sample showed that a large proportion of the tail consisted of the most popular browsers and operating systems, but with different installed toolbars or corporate network messages that distinguished them as a unique string.
And I must stress again, these figures don't represent the breakdown of visitors to the BBC site as a whole, they are based on requests to the homepage alone, over the course of one week in September. Nevertheless I think they provide an interesting snapshot of web activity.
In total I've examined around 32 million requests to the BBC servers - although some of these have been discounted as 'unknowns' and some originate from crawlers and spiders.