Grub and grubby Looksmart

Martin Belam by Martin Belam, 19 April 2003

There has been much talk about Looksmart's acquisition of Grub, a distributed web crawling project, and their plans to incorporate its data into their very stale Wisenut search engine.

New Scientist picked up on the story, with a Danny Sullivan pullquote:

"I have more faith in companies that control their own crawl and index than I do in approaches that ask people to submit their own data"

And there is clearly an exploitable flaw in the Grub system, which is just inviting spam. Grub say in their own FAQ:

"If you run a website or host multiple websites, you would want to run the client because it will index your own content before it crawls other sites. Having your content auto-update into the search engines is a powerful motive to run the client"

Wired news also covered the story, commenting that usage had oncreased from 100 to 1,000 users very quickly, but that is clearly very small beer compared to something like SETI@home. They also got a great quote from Peter Norvig, director of search quality at Google. Grub are claiming that their approach will allow them to crawl all of the web all of the time. Affirming that size isn't everything in the search engine world, Norvig observes that

"It isn't a problem of computing resources but deciding what parts of the Web should be updated more frequently than others. I don't want more computers or bandwidth. I want more clues about which page to look at rather than another page. The problem for us is how do we direct the crawl, not do we have enough resources to get the crawl."

Grub also has a reputation for being a particularly badly behaved bot. Historically it has not dealt well with 403 and 404s, as well as ignoring the robots meta-tag. The guys from Grub did show up on Webmasterworld to enter some sort of dialogue about bad-bot-fixes, and back in January this was the driver behind their robots refresh page which apparently causes Grub to instantly update its copy of your robots.txt file. Realistically, if I change my robots.txt file, I don't additionally expect to have to go out and tell a robot I have changed it. That is what it is there for.

And I can't help observing that despite their work with the Zeal community, Looksmart doesn't exactly enjoy the best of reputations within the webmaster community - and so seems a natural home for a distributed crawling project that also has a poor reputation.

However, in discussion about this, a friend of mine on collective suggested the idea of a "socialist distributed network", where the computers connected to the internet actively help maintain its integrity. To an extent at ISP and distribution level this already happens. But it would be neat to download a screensaver that could actively exploit unused bandwidth and processing to improve the network for the benefit of all. Perhaps a one-size-fits-all protocol screensaver application where people could vote for which distributed computing project they would most like to help, and central servers would allocate the resources according to either subscription or demand.

Keep up to date on my new blog