Crawl before you can walk…

February 26th, 2010

Today a crawl was completed!

It was set with a depth of 5, which means it will crawl 5 levels deep in a breadth-first manner.

It took 10 hours with one single threaded crawler and discovered just over 5,000 pages.

Issues downloading files

February 25th, 2010

With more intensive testing, using the robots service and the crawler – a few issues have surfaced.

They are mainly caused by the time-out mechanism. On destruction of the DownloadFile class, the mutexes are destoryed, however some are still locked on destruction which under boost throws a SIGABRT.

The class now ensures they are all unlocked before being disposed.

More services…

February 24th, 2010

Over the course of development, several libraries have been made for processing, representing and storing various forms of data, one of which contains some classes which either don’t fit into an existing library and can’t justify their own, or are just used so much within the project that they can’t be tied down to just one library, this is the common library. In common is a URL class, and as of today the URL class can calculate relative URLs. This means that when URLs are discovered on a page they can be converted to absolute URLs if they were coded as relative URLs.

The ‘Insert’ service has also been created, crudely. The insert services takes new, unseen URLs which need to be processed and currently puts then in a MySQL table.

The ‘To-Do’ service pulls from the list created by the insert service, sending, upon request the next URL to crawl to a requesting crawler.

The ‘Cache’ service has been put together using a trie structure to store the URLs encountered. So if they exists, they have been seen, otherwise they have not, but are then added as the check implies acknowledgement.

A simple version of the crawler has been created, which interacts with all services and required all services to be created before it could be made.

Robots Service

February 23rd, 2010

With downloading and parsing robots now complete I was able to completely create the robots service today. It caches robots data indefinitely in a MySQL database, using the recently written mysql library, which tidily wraps the mysql++ library for my specific use.

Later on I will re-write the robots service so that it does not use MySQL, supports ageing of data, and copes with temporary HTTP errors (404) with several retry attempts before concluding there is no robots file for that domain.

Downloading files using C++

February 22nd, 2010

Today’s main task has been to find a reliable method for downloading files. The criteria are simple, it needs to support http but https does not matter, it needs to be able to time-out and it must be able to be called from several threads simultaneously. The latter was an unexpected problem I will talk about.

Curl is a well known library for downloading files and has several implementations, including a C++ one called cUrl++ (cUrlpp). This was my starting point, I assumed that as it wrapped a well known and reputable library then it would be the best to use. Unfortunately due to the underlying C implementation it relies on a static method for copying the resultant data to an accessible variable. It is used as a callback and being static has to access static variables. This meant multi-threaded use failed when the static method was called simultaneously by cUrl++ on two (or more) threads.

The only other suitable means I found was to use boost, and with this as my other option I decided to write my own code to do it. Using a combination of Asynchronous and Synchronous I/O, combined with a time-out thread and a few mutexes I ended up with exactly what I needed (without https). Multi-threaded access works fine as no variables or methods are static, and the time-out thread is bound using boost::bind, so that it can access instance variables.

With this done I could move on to the current need to download files – the robots parser. The current implementation carefully processes the robots file extracting rules that apply to the specified (our) user agent, if explicitly listed in the robots file, and otherwise it extracts the wild-card rules. It uses the following features to process the robots data considered relevant:

  • Then when matching rules to URLs, the longest, most specific match takes precedence.
    • Bing is known to take this approach, where as google just takes the first match, their and pros and cons to both approaches.
  • URL matching is not case sensitive.
  • Comments are stripped out whether they’re in-line or on their own line.
  • Rule values (aka the path) are trimmed.
  • User-Agent matching is not case sensitive and done on first ‘word’ (string before the first white-space character)
    • This could be considered controversial, but some user agents vary (Googlebot contains the version)
    • In the future I might use regex or a ‘contains’ check
  • The data relevant to ‘our’ User-Agent can be extracted so that only what is needed can be cached
  • The Robots library can extract and import the ‘relevant’ information OR take a raw robots file


February 19th, 2010

A C++ nested type my be defined carefully.

A vector of maps must be defined as

vector<map<type,type> > and NOT:

as the >> notation is regarded by the compiler as >>, aka stream >> string and not a nested type.

Be careful to add the space so >> is actually > >

I have been informed that this may be addressed in the next C++ standard…


February 18th, 2010

I have taken the time to tackle the MySQL++ wrapper for the C based standard mysql library. It took a while to resolve the dependencies, but essentially it needs mysqlclient15 (+dev).

I then created a wrapper for MySQL++ with two main functions, execute and query.Execute is used for commands like INSERT, DELETE and such, which do not return results. Query returns a vector<map<std::string,std::string> >, aka a list of key-value-pairs, aka a list of columns and values. (See the next post about a vector of maps.)

Log viewer

February 17th, 2010

So that logs can be viewed easily and in real-time I have needed to create a log client which connects to the log server and retrieves only updates since the last log file update. The viewer is written in C# and supports Debug, Information, Warning and Error messages.

A log client library has also been created for two-way access to the log server, both adding to and removing from the log server. A mutex is used to ensure atomic updates.

Log File

February 17th, 2010

A simple class has been created to support a log file format, safely appending data to a logfile while supporting searching and pointer reference operations for speedy lookup and update of log clients.

Centos 5 Services

February 15th, 2010

After January exams, I have rewritten much of the system code due to having a greater understanding of c++ and how to program such a language. One of many improvements was to run the applications as services.

A sample service script can be found at:

It requires the forking of the main process so that starting the service does not hang, a very good tutorial can be found here

Another new feature is a log server, for collecting debugging messages so that they can be viewed from a single location using a basic TCP command set.

I have also created a client library for accessing the setting server through a simple interface.