Crawling and parsing!

After a few small teething issues the crawler now works on the new cluster pretty much as it did before.

For the first time I have noticed an issue that I otherwise suspected would be the case after working on an assignment with a friend where it was a problem. Sending an Accept header in a HTML Request is often only honoured by Microsoft’s IIS, but ignored by Apache. This means though the crawler is only asking for text/html pages it is getting anything from a linux server (images, zips, pdfs and all sorts). Strangely (for Apache) both servers send a Content-Type response header which I am now also using to check the content-type and not parsing any pages that don’t appear to be text/html.

I have considered scanning page content to also determine if it is parse-able by checking for traditionally binary-only characters (aka outside the ASCII range) though I’m not sure how successful this would be.

The crawler is now parsing using pyRDFa! However it does seem to have a memory leak associated somewhere as after about 10 minutes of crawling it one process was using 500+ Mb of RAM. This is causing linux to start paging and it slows the system down greatly…

Leave a Reply