Paged Queue

The Paged Queue now automatically tunes the diversity. Since the last time the temporary page was processed it works out roughly how many pages were crawled a second and sets the diversity to the reciprocal of this, meaning that no one domain should (in theory) be crawled more than once a second. I know this is flawed as it is based on past data – but it does the job for now.

Eg:

If the average is 10 pages/sec the diversity will be 1/10 or 0.1 or 10% meaning at most 1 in 10 entries in the page will be of the same domain.

Leave a Reply