Designing web-crawlers

Table of contents [expand all] [collapse all]

Introduction

This document tries to give some guidelines for people who are designing web-crawlers.

Common sins

There are some things a spider developer should know to avoid his toy being blacklisted on webmasters' spider databases.

No brains about HTTP

Nothing is more aggravating for a webmaster than a client who doesn't understand "access denied".

Your spider should take the best possible care that all error situations are handled properly.

If the server says 403 ACCESS DENIED, the spider shouldn't handle it as a temporary situation and retry in one minute.
The access block will probably stay there until the hell freezes.

If the server says 406 NOT ACCEPTABLE, it's probably caused by a multilanguage choice thing. It doesn't go away by reloading the page.

And so on.
A spider that doesn't obey the HTTP is bound to get banned.

If you intend to crawl the World Wide Web, you must become an expert at the World Wide Web.

Just to mention a few for example, there are many things to be considered.

  • Transient errors
  • Permanent errors
  • Redirections
  • Caching
  • Partial fetching
  • Virtual hosts
  • URN structure


PS: See USPTO pt.no. 6611835, which basically patents that if a search engine encounters a 404 HTTP error code in course of its spidering it deletes the page from its index, or if it encounters a redirect (302) it updates its metadata accordingly too.
What a wonderful unique invention! Don't software patents truly promote creativity??

No brains about content

If the spider is to index only text, it shouldn't start loading all hundreds of big .gz and .avi etc files. Even if it loads only the first 10 kilobytes of each file, it's still disturbing to see somebody taking the whole patch archive from a software development page, only to ignore it completely.

No proper contact info

A good user-agent string should never change (except the version id), and it should contain at least the following, most important first:

  • An URI pointing to a webpage explaining the spider purpose, name, behavior etc in English.
Note: An email address is not enough. People are very wary to send email to unknown people. After all, there's no knowing if the address might actually belong to a spammer that harvests webmasters' email addresses - or if the address works at all.
Webmasters also aren't eager to wait for an answer. They want to decide immediately if the spider should be banned or not.
  • The agent name - the clearer, the better.
No nicknames. Don't force the webmasters to have to guess which of the words is the robot looking for in robots.txt.

Particularly the spider should never masquerade as something it is not. Yes, this includes the "Mozilla-compatibles".

It's not a good idea to leave a nondescribe "ruby::urllib" or a "perl::crawler-1.5" there either. Such signatures don't answer the question "why are you doing this".

No advertisements, no slogans!
Such advertisements may be given on the spider explanation page, although not recommended. If you slap the webmasters' face (they're always monitoring their access logs) with your egoistic words, everything you represent will just become boycotted.

Too frequent attempts

If the spider attempts to load pages too often, it will be banned. Attempts from all spiders in total shouldn't come more often than once in 10 seconds. A period of 20 seconds is the smallest considerable one.

Simultaneous connections

Spiders are rarely doing this, but download programs are.
Simultaneous connections are bad.
The TCP/IP works like a voting system. When you have multiple connections to the same server, it's like if you had placed multiple votes. It is hostile against other users.

Robot standards

There are some standards about how crawlers should work.

All robots should read the /robots.txt before doing anything else, and obey it.

The robots meta-tag is an instruction, given by goodwilling webmasters, telling the robots how to handle pages.
A good-behaving spider obeys it. Obeying it may significantly improve the quality of the page collection it has indexed.
Robots failing in these aspects are often suspected to be greedy spammer mail
harvesters or blog-spammers/wiki-spammers, and are thus extremely unwelcome.

Too greedy / obscure HTML parsing

Greedy spiders are easily recognized by one fact:

  • They want everything.

Greedy spiders don't follow robot exclusions.
Greedy spiders reload and reload when banned.
Greedy spiders follow everything that looks like an URL, even if it's in a <!-- comment --> which is never rendered.

Yes, if the spider is designed to parse HTML content, it should do so!
In HTML, links are found in A tags in the HREF parameter field.
URLs found in FORM's ACTION or INPUT's VALUE or text content or HTML comments are not to be followed!

If the webmaster wanted links in comments or text to be followed, he would have put an A HREF to it.

Spammers are greedy. They want email addresses. They want to insert their dubious advertisements to every web page in the world even if there's no link to them. Don't look like a spammer.

Me

I'm just a webmaster at http://bisqwit.iki.fi/ and its subsites. My email address is 6biNsqXzwiAt@uQikhg0i.Nfi.

Last edited at: 2006-06-30T19:16:21+00:00