TCP, Paged Q and a Crawl

  • The main code of the paged queue has now been finished off
  • The crawler is now multi-threaded, by simply running the existing code many times on many threads
  • The TCP server has had to be modified – each client was being assigned their own thread
    • Now each request is assigned a thread for the duration of the request
    • This occurred as 21 machines each with 20 crawling threads made up to 420 connections to the services
    • The GCC imposed thread limit on my cluster is ~300
  • Another depth 5 crawl was done, which tool 2 hrs and covered 413,387 pages.

Leave a Reply