This document tries to give some guidelines for people
who are designing web-crawlers.
There are some things a spider developer should know
to avoid his toy being blacklisted on webmasters'
spider databases.
No brains about HTTP
Nothing is more aggravating for a webmaster than a client who
doesn't understand "access denied".
Your spider should take the best possible care that all error
situations are handled properly.
If the server says
403 ACCESS DENIED, the spider shouldn't
handle it as a temporary situation and retry in one minute.
The access block will probably stay there until the hell freezes.
If the server says
406 NOT ACCEPTABLE, it's probably
caused by a multilanguage choice thing.
It doesn't go away by reloading the page.
And so on.
A spider that doesn't obey the HTTP is bound to get banned.
If you intend to crawl the World Wide Web,
you must become an expert at the World Wide Web.
Just to mention a few for example, there
are many things to be considered.
- Transient errors
- Permanent errors
- Redirections
- Caching
- Partial fetching
- Virtual hosts
- URN structure
PS: See USPTO pt.no.
6611835,
which basically patents that if a search engine
encounters a 404 HTTP error code in course of its spidering it deletes the
page from its index, or if it encounters a redirect (302) it updates its
metadata accordingly too.
What a wonderful unique invention!
Don't
software patents truly promote creativity??
No brains about content
If the spider is to index only text, it shouldn't start
loading all hundreds of big .gz and .avi
etc files. Even if it loads only the first 10 kilobytes of each
file, it's still disturbing to see somebody taking the whole patch
archive from a software development page, only to ignore it completely.
No proper contact info
A good user-agent string should never change (except the version id),
and it should contain at least the following, most important first:
- An URI pointing to a webpage explaining the spider purpose, name, behavior etc in English.
Note: An email address is not enough. People are very wary
to send email to unknown people. After all, there's no knowing if
the address might actually belong to a spammer that harvests
webmasters' email addresses - or if the address works at all.
Webmasters also aren't eager to wait for an answer. They want to
decide immediately if the spider should be banned or not.
- The agent name - the clearer, the better.
No nicknames. Don't force the webmasters to have to guess
which of the words is the robot looking for in robots.txt.
Particularly the spider should
never masquerade as something it is not.
Yes, this includes the "Mozilla-compatibles".
It's not a good idea to leave a nondescribe "ruby::urllib"
or a "perl::crawler-1.5" there either.
Such signatures don't answer the question "why are you doing this".
No advertisements, no slogans!
Such advertisements may be given on the spider explanation page, although
not recommended. If you slap the webmasters' face (they're always
monitoring their access logs) with your egoistic words,
everything you represent will just become boycotted.
Too frequent attempts
If the spider attempts to load pages too often, it will be banned.
Attempts from all spiders in total shouldn't come more often
than once in 10 seconds. A period of 20 seconds is the smallest
considerable one.
Simultaneous connections
Spiders are rarely doing this, but download programs are.
Simultaneous connections are bad.
The TCP/IP works like a voting system. When you have multiple
connections to the same server, it's like if you had placed
multiple votes. It is hostile against other users.
Robot standards
There are some standards about how crawlers should work.
All robots should read the /robots.txt before
doing anything else, and obey it.
The robots meta-tag is an instruction, given by goodwilling webmasters,
telling the robots how to handle pages.
A good-behaving spider obeys it. Obeying it may significantly
improve the quality of the page collection it has indexed.
Robots failing in these aspects are often suspected to be greedy spammer mail
harvesters or blog-spammers/wiki-spammers, and are thus
extremely unwelcome.
Too greedy / obscure HTML parsing
Greedy spiders are easily recognized by one fact:
Greedy spiders don't follow robot exclusions.
Greedy spiders reload and reload when banned.
Greedy spiders follow
everything that looks like an URL,
even if it's in a <!-- comment --> which is never rendered.
Yes, if the spider is designed to parse HTML content, it should do so!
In HTML, links are found in
A tags in the
HREF parameter field.
URLs found in
FORM's
ACTION or
INPUT's
VALUE or text content
or HTML comments are
not to be followed!
If the webmaster wanted links in comments or text to be followed,
he would have put an
A HREF to it.
Spammers are greedy. They want email addresses. They want to insert their dubious
advertisements to every web page in the world even if there's no link to them.
Don't look like a spammer.