Crawling and parsing!

April 13th, 2010

After a few small teething issues the crawler now works on the new cluster pretty much as it did before.

For the first time I have noticed an issue that I otherwise suspected would be the case after working on an assignment with a friend where it was a problem. Sending an Accept header in a HTML Request is often only honoured by Microsoft’s IIS, but ignored by Apache. This means though the crawler is only asking for text/html pages it is getting anything from a linux server (images, zips, pdfs and all sorts). Strangely (for Apache) both servers send a Content-Type response header which I am now also using to check the content-type and not parsing any pages that don’t appear to be text/html.

I have considered scanning page content to also determine if it is parse-able by checking for traditionally binary-only characters (aka outside the ASCII range) though I’m not sure how successful this would be.

The crawler is now parsing using pyRDFa! However it does seem to have a memory leak associated somewhere as after about 10 minutes of crawling it one process was using 500+ Mb of RAM. This is causing linux to start paging and it slows the system down greatly…

Starting again…

April 12th, 2010

It took almost 6 hours, but the servers are now reinstalled and stable, and the network switch has been replaced.

The loading process worked fine if the machine had only 1 NIC which most did, but for the two with 2 NICs they had to be done semi-manually.

22 of the 24 servers are made of the same components, and they started crashing every 10 minutes when the kernel tried to turn the screen off to save power, but with no screen attached the system freezes.

Eventually a solution was found:

An(other) option is to disable the framebuffer support by adding the nomodeset kernel option in/etc/default/grub:

GRUB_CMDLINE_LINUX_DEFAULT="quiet splash nomodeset"

The box has been up for more than an hour since disconnecting the display now.


Installer System

April 11th, 2010

With the need to reinstall, I made a new installer system which as before automates the install using PXE and kickstart, but also tracks downloads from each client so that progress can be estimated.

The Kickstart files are now generated from a database meaning the servers can have predefined roles assigned to them which in turn selects specific software to install.

So far it works well, however this has only been tested on a  small ‘cluster’ of 4 virtual machines.

Ubuntu FTW

April 6th, 2010

The Ubuntu system image is almost ready, it by default installs:

  • Minimal base system
  • Removes OpenOffice (for some reason this is considered minimal even without a desktop environment)
  • gcc/g++
  • rdflib
  • pyRdfa
  • MFS
  • Boost

Unfortunately the network switch linking the cluster seems to be on its last legs. Under load it resets frequently, taking the cluster down for several minutes at a time – very annoying!

Building a repository

April 5th, 2010

Unlike many distributions, the Ubuntu repository is arranged so that old files are mixed with current files making it hard to download the packages for the latest release and make a repository. To combat this I have made a little script which acts as a proxy, feeding requests to the Ubuntu archive server then caching the results for when subsequent requests are made.


Slow going

April 4th, 2010

The Ubuntu image is making progress, but testing is slow as reinstalling it on a virtual machine only takes a few minutes, it all adds up when I have to test new post scripts.

Consolidation, RRD and some testing

April 3rd, 2010

I previously had a universal share called “share” shared on one machine and mounted by the rest. With MFS setup I have now removed share and copied the contents to MFS remounting it in the same place so that now it is decentralized and redundant.

Afterwards I had to rebuilt RRDtool with –disable-mmap as it was getting mmap related errors with MFS mounted file systems. This is most likely due to CentOS using old libraries.

A test crawl was run to make sure everything was still working…Depth: 5, 300k pages, 55 minutes.

I do need to replace CentOS with Ubuntu for all the latest pyRdfa code to work, a system image is currently being created.

The pyRdfa to Sesame process now correctly URLEncodes the baseURI and context, however I cannot test this in the crawler at the moment as pyRdfa runs very badly on Centos when using Lax mode (some html5 issues).

CentOS will be gone soon – it is too out of date for my needs.

Sesame and RDFa

April 2nd, 2010

After much playing around I can now insert RDFa generated by pyRdfa into a Sesame triplestore. I modified my download_file library to allow it to PUT data too, so now the verb can be changed and a request body can be specified.

Many pages, when fed through pyRdfa need lax parsing turned on (-l) which in turn is best used with warnings hidden (-w) meaning my final set of options are -xwl which is essentially what W3C use as default.

RDFa is currently extracted using pyRdfa to RDF/XML then sent to Sesame.

Sesame is set up on Tomcat, and storing data on the MFS file system.


March 31st, 2010

A custom implementation of is in my /usr/bin/ as parseRDFa.

My lib_rdfa in c++ takes a page, save it in /tmp, calls parseRDFa and then gets the result through the standard output stream. Errors are caught in the python and reported nicely.

Sesame install

  • Install Tomcat 5+
  • Edit /etc/default/tomcat
    • Use java security, NOT tomcat
    • Make or set the default path for sesame (+permissions)


March 30th, 2010

Every time I tweak one area to improve performance, 10 new issues arise.

At this point it is time to focus on the RDFa element – the crawler will be limited to a sensible number of threads to avoid crashing services and performance related work will be put on hold – for now…

I will be using pyRdfa (also known as the W3C Distiller) to parse RDFa and export RDF/XML, Turtle or N-Triples and I will be using the Sesame triplestore.

pyRdfa Install (Ubuntu, will hopefully be easy to port to CentOS)

  • Download pyRDFa from W3C
  • Install python-setuptools (apt-get)
  • Install python-html5lib (apt-get)
  • Install python-dev (apt-get)
  • easy_install -U “rdflib==2.4.2”
  • Run “ install” after extracting pyRdfa
  • Remove next_rdfa_version from example code to get around weird bug in