Every time I tweak one area to improve performance, 10 new issues arise.

At this point it is time to focus on the RDFa element – the crawler will be limited to a sensible number of threads to avoid crashing services and performance related work will be put on hold – for now…

I will be using pyRdfa (also known as the W3C Distiller) to parse RDFa and export RDF/XML, Turtle or N-Triples and I will be using the Sesame triplestore.

pyRdfa Install (Ubuntu, will hopefully be easy to port to CentOS)

  • Download pyRDFa from W3C
  • Install python-setuptools (apt-get)
  • Install python-html5lib (apt-get)
  • Install python-dev (apt-get)
  • easy_install -U “rdflib==2.4.2”
  • Run “setup.py install” after extracting pyRdfa
  • Remove next_rdfa_version from example code to get around weird bug in localRDFa.py


Leave a Reply