Grub and grubby Looksmart

Martin Belam
Written by
Published 19 April, 2003
Categories:

<< previous | next >>
No comments yet 
Add your comment Add your comment

There has been much talk about Looksmart's acquisition of Grub, a distributed web crawling project, and their plans to incorporate its data into their very stale Wisenut search engine.

New Scientist picked up on the story, with a Danny Sullivan pullquote:

"I have more faith in companies that control their own crawl and index than I do in approaches that ask people to submit their own data"

And there is clearly an exploitable flaw in the Grub system, which is just inviting spam. Grub say in their own FAQ:

"If you run a website or host multiple websites, you would want to run the client because it will index your own content before it crawls other sites. Having your content auto-update into the search engines is a powerful motive to run the client"

Wired news also covered the story, commenting that usage had oncreased from 100 to 1,000 users very quickly, but that is clearly very small beer compared to something like SETI@home. They also got a great quote from Peter Norvig, director of search quality at Google. Grub are claiming that their approach will allow them to crawl all of the web all of the time. Affirming that size isn't everything in the search engine world, Norvig observes that

"It isn't a problem of computing resources but deciding what parts of the Web should be updated more frequently than others. I don't want more computers or bandwidth. I want more clues about which page to look at rather than another page. The problem for us is how do we direct the crawl, not do we have enough resources to get the crawl."

Grub also has a reputation for being a particularly badly behaved bot. Historically it has not dealt well with 403 and 404s, as well as ignoring the robots meta-tag. The guys from Grub did show up on Webmasterworld to enter some sort of dialogue about bad-bot-fixes, and back in January this was the driver behind their robots refresh page which apparently causes Grub to instantly update its copy of your robots.txt file. Realistically, if I change my robots.txt file, I don't additionally expect to have to go out and tell a robot I have changed it. That is what it is there for.

And I can't help observing that despite their work with the Zeal community, Looksmart doesn't exactly enjoy the best of reputations within the webmaster community - and so seems a natural home for a distributed crawling project that also has a poor reputation.

However, in discussion about this, a friend of mine on collective suggested the idea of a "socialist distributed network", where the computers connected to the internet actively help maintain its integrity. To an extent at ISP and distribution level this already happens. But it would be neat to download a screensaver that could actively exploit unused bandwidth and processing to improve the network for the benefit of all. Perhaps a one-size-fits-all protocol screensaver application where people could vote for which distributed computing project they would most like to help, and central servers would allocate the resources according to either subscription or demand.

No comments yet
Leave your comment

A limited set of HTML tags are allowed in comments: a href, strong, em, ul, li, blockquote
To protect against spam your comments will not appear on the site until I have manually published them.
Your email address will never appear on the site.

  

  

  


Alan Turing wouldn't be impressed with this crude test, but please prove you are a person and type toothpaste into the box below.

Search

Get updates by Email or RSS


Email icon    RSS icon

Sign up to get free updates by email
  

Training

"Learn to blog smart: join the conversation" with Martin Belam - London 26 February, 2009

About Martin Belam

I'm a London-based internet consultant and writer, with 8 years experience in product management, information architecture, and user experience design for global brands like Sony, Vodafone, The Guardian and the BBC. I specialise in advising on search, widgets, RSS, online news publishing and bulk email delivery.
Martin Belam CV
email: martin.belam@currybet.net
tel: +44 (0) 7801 828718
twitter: currybet
About Martin Belam and this site

Recent posts

Popular categories

BBC, Doctor Who, Ghost Walks, Media, Music, Newspapers, Search, Social media, Web