Forking

March 26th, 2010

After some segabrt/segfaults the download library’s time-out code has been cleaned up. It sometimes tried to destruct locked mutexes.

Minor tweaks and such have gone on but more recently I did a bit of research on forking – though my services use it I was still unclear what exactly it allows:

  • Forking creates new processes
  • Each process has the thread limit
  • Fork twice, 300 per process, total 600 threads
  • In this case, not much point as mutex locks cause everything to wait so it would just wait in another process – no performance gain
  • The powercache and paged queue need work – cant escape that.

Issues…Grrr.

March 25th, 2010

3 issues came up today.

  1. TCP Servers (used in many services) were segfaulting all over the place
  2. The ToDo service was returning a nonsense entry after the last entry in a page
  3. Despite earlier fixes the tcp services are still using too many threads because some libraries use mutexes to ensure synchronous access, causing processing threads to wait for the last one to finish.

My notes and solutions

  • tcp server segfault
    • fixed with try catch I think, it is throwing segabrt somtimes (NO! Didn’t sort the segfault!)
    • If response == “” dont reply, seems to fix segfault, no request then read fails… NO! WRONG!
    • Was destroying self before processing thread was done if connection was closed, on close wait for thread then clean up – Works! 🙂
    • Used my own threadsafe_int class to track open threads
  • To Do nonsense
    • Was returning h followed by some random square
    • Crude fix: If response < 7 consider invalid, ignore and move to next page
  • tcpservice::boost::thread_resource_error
    • Track running/waiting threads
    • If > n (about 250) wait before creating a new one
    • Will deal with the later
    • Forking?

The last issue is mainly caused by the powercache not being multi-threaded and the paged queue hanging during processing. I need to improve the powercache and the paged queue needs to open a new temp-page and process the current temp-page on a separate thread.

Paged Queue

March 24th, 2010

The Paged Queue now automatically tunes the diversity. Since the last time the temporary page was processed it works out roughly how many pages were crawled a second and sets the diversity to the reciprocal of this, meaning that no one domain should (in theory) be crawled more than once a second. I know this is flawed as it is based on past data – but it does the job for now.

Eg:

If the average is 10 pages/sec the diversity will be 1/10 or 0.1 or 10% meaning at most 1 in 10 entries in the page will be of the same domain.

MooseFS

March 20th, 2010

MooseFS

This GFS (Google FS, not the RedHat thingy) like storage implementation is my current way of tackling the problem of distributing all the data that I will need to store, without having to devise my own solution – but why would I want to?! It is awesome!

It is well worth a look if you need to distribute data across many machines for performance, scalability or both!

A snippet from the MooseFS site:

MooseFS is a fault tolerant, network distributed file system. It spreads data over several physical servers which are visible to the user as one resource. For standard file operations MooseFS acts as other Unix-alike file systems:

  • A hierarchical structure (directory tree)
  • Stores POSIX file attributes (permissions, last access and modification times)
  • Supports special files (block and character devices, pipes and sockets)
  • Symbolic links (file names pointing to target files, not necessarily on MooseFS) and hard links (different names of files which refer to the same data on MooseFS)
  • Access to the file system can be limited based on IP address and/or password

Distinctive features of MooseFS are:

  • High reliability (several copies of the data can be stored across separate computers)
  • Capacity is dynamically expandable by attaching new computers/disks
  • Deleted files are retained for a configurable period of time (a file system level “trash bin”)
  • Coherent snapshots of files, even while the file is being written/accessed
  • TCP, Paged Q and a Crawl

    March 15th, 2010
    • The main code of the paged queue has now been finished off
    • The crawler is now multi-threaded, by simply running the existing code many times on many threads
    • The TCP server has had to be modified – each client was being assigned their own thread
      • Now each request is assigned a thread for the duration of the request
      • This occurred as 21 machines each with 20 crawling threads made up to 420 connections to the services
      • The GCC imposed thread limit on my cluster is ~300
    • Another depth 5 crawl was done, which tool 2 hrs and covered 413,387 pages.

    A waste of time

    March 10th, 2010

    For the last week or so I have been forced to conclude that the xtrie is no good for this high performance need. In single threaded use it is too slow, and with locking and multi-threaded use it is, if anything, slower but more importantly suffers from deadlocks (which I accept is going to be due to my locking code).

    Unfortunately I have a nice big table of results that proves my xtrie is not fit for this purpose…

    I devised two locking methods, with no locking it would take t time to create and then read the contents of the trie, with method 1 it took 4t and method 2 took 10t.

    Some values also seemed to get lost in the trie, in other words they would go in, but not come out – and upon examining the raw data it seemed like the trie marked them as done even if they were not.

    Instead of continuing with this, I am taking a different approach, one which will provide much greater performance gains for less effort.

    The paged queue will replace the xtrie, the main emphasis to achieve performance is to uphold the rule (newly created) that data should be read and written as sequentially as possible, with minimal jumping (seeking) around the data file as this provides best performance.

    • The Insert Service receives URLs to be crawled from the crawler, these are inserted into a temporary queue.
    • One of two events will cause the temporary queue to be dumped and processed
      • The queue hits a size threshold
      • The current page is empty
    • When the temporary queue is processed all the domains of URLs are counted so that the most common can be identified
    • If the most common domain consisted of more than n% (50% at most) of the URLs then entries were removed until this was no longer the case
      • Any removed domains were put to one side and inserted back into the temporary queue at the end
    • The URLs left were then sorted so that no two URLs in sequence had the same domain
      • For example the domains 0 1 1 2 2 2 3 3 3 4 4 4 4
      • 4 is most common and makes up 4/13 of the entries
      • They would be sorted:
      • 4 3 2 4 1 3 0 2 4 1 3 2 4
      • This is not exactly how my algorithm sorts them, it tries to maintain maximum diversity
      • If any URLs have been removed to preserve diversity they are added to the temporary page after it is wiped clean
    • Once sorted they are saved in a page in that order where they can be read sequentially by the ToDo service
    • A Page Manager keeps track of the current page in use, and free pages that can be filled after processing
    • Once a page is emptied by the ToDo service it is wiped clean and marked as free so it can be used at a later date.
    • Most data is kept on the file system to protect against software failure
      • The temporary queue is stored on the file system
      • The current page has a maker written at the last entry requested so that if the service crashes it can recover from where it left off
      • The Page Manager stores all its working data in a small data file in case of crashes
      • Temp page processing is done in memory, but the temporary page is only wiped after the processing is done
      • In memory data structures are kept as small as possible and queues/lists are cleaned as soon as their not needed to minimise memory footprint
        • In theory if the temporary page threshold is 1M then it should not use much more memory for processing as only one copy of the data is maintained at any one time

    Potential Improvements

    Processing cases the main thread to wait and currently is a bottleneck that will need to be addressed later on.

    The diversity threshold could vary based on historical data.

    The Cache

    The cache is now the tried again, still called the PowerCache Service. It uses block level caching and supports but does not currently use locking as it only adds overhead.

    The trie in use is referred to as an fclTrie (fast, cached, locked).

    Xtries and Crawls

    March 2nd, 2010

    Progress:

    • Many xtrie bugs ironed out, however there is still a memory leak
    • It will need some form of block level locking for multi-threaded access
    • 2nd big crawl completed, different start point to last time. Look 2hrs 45 mins and covered 228,201 pages

    XTrie & PowerCache

    March 1st, 2010

    The XTrie now acts as a to-do mechanism. Details to follow…

    The ‘Insert’ and ‘To Do’ services have been merged into one called the ‘PowerCache’ which runs the trie in place of its two predecessor services.

    The Hard-Core details

    When the URLs are put in to the trie, they are done on a per-character level. Each character is stored as a block if it is new. Each block consists of:

    • The location of the ‘first child’
    • The location of the ‘next peer’
    • The Round-Robin to-do entry (also a location)
    • The ‘flags’

    The locations are pointers (64-bit unsigned integers) to other blocks. The flags indicate one of two things, whether the block is terminal, ie is the the end of a string (clarified later), if terminal whether the URL has to be crawled (ToDo) and whether there are child blocks that represent URLs that need to be crawled (otherwise known as inherited ToDo).

    Imagine storing two URLs – www.test.com and www.nest.com.

    The domain of the URL is reversed to that the root is always from the left (domains are from the right and paths are from the left in URLs), so www.test.com becomes moc.tset.www (and www.test.com/a/path would become moc.tset.www/a/path) which preserves the left-rooted hierarchy.

    Now we have moc.tset.www and moc.tsen.www, they will be stored in the trie in the following hierarchical structure (indentations implies a new tree/trie level and pipes | link levels)

    m
     o
      c
       .
        t
         s
          e
           t
           |.
           | w
           |  w
           |   w
           n 
            .
             w
              w
               w

    This will probably seem wasteful at first but there are reasons for my design decisions. If, instead of each letter the first distinctive prefix was used it would mean that data blocks of different sizes would occur and when/if they were split all data around them would need to be shifted.

    In the above example if we look at m, it’s first child will point to o and its next peer will point to null. On the other hand, t‘s first child is . and its next peer is n. The RR-Todo value is usually initialized to the same as the first child.

    All data is shown below for the example, hopefully this will clarify how the system works.

    ID is the pointer location, entry is the value itself, FC is the first child, NP is the next peer, RR is the Round Robin To Do value, T is the terminal flag, I is the inherited To-Do and D is To Do.

    ID Entry         FC NP RR T I D
     1 m              2  0  2 0 1 0
     2  o             3  0  3 0 1 0
     3   c            4  0  4 0 1 0
     4    .           5  0  5 0 1 0
     5     t          6  0  6 0 1 0
     6      s         7  0  7 0 1 0
     7       e        8  0  8 0 1 0
     8        t       9 13  9 0 1 0
     9        |.     10  0 10 0 1 0
    10        | w    11  0 11 0 1 0
    11        |  w   12  0 12 0 1 0
    12        |   w   0  0 13 1 0 1
    13        n      14  0 14 0 1 0
    14         .     15  0 15 0 1 0
    15          w    16  0 16 0 1 0
    16           w   17  0 17 0 1 0
    17            w   0  0  0 1 0 1

    The rows in red above will be followed on the first request for a URL To Do

    After the URL has been returned the ToDo flag will be updated, and the inherited ToDo will be updated recursively if it needs to be changed.

    ID Entry         FC NP RR T I D
     1 m              2  0  2 0 1 0
     2  o             3  0  3 0 1 0
     3   c            4  0  4 0 1 0
     4    .           5  0  5 0 1 0
     5     t          6  0  6 0 1 0
     6      s         7  0  7 0 1 0
     7       e        8  0 13 0 1 0
     8        t       9 13  9 0 0 0
     9        |.     10  0 10 0 0 0
    10        | w    11  0 11 0 0 0
    11        |  w   12  0 12 0 0 0
    12        |   w   0  0 13 1 0 0
    13        n      14  0 14 0 1 0
    14         .     15  0 15 0 1 0
    15          w    16  0 16 0 1 0
    16           w   17  0 17 0 1 0
    17            w   0  0  0 1 0 1

    Note in green, to ToDo flag of the last w is gone and so the inherited ToDo flags have been cleared, however ID 7 in blue still inherits a ToDo entry as n at 13 carries an inherited ToDo value. Note also, that all RRTodo values  of blocks traversed (1-12) have advanced to the next child if possible (in bold), in this case only the block at 7 has, but if there were several root entries, other than m then this would’ve moved along to the next one.

    After the second request the table would look like this:

    ID Entry         FC NP RR T I D
     1 m              2  0  2 0 0 0
     2  o             3  0  3 0 0 0
     3   c            4  0  4 0 0 0
     4    .           5  0  5 0 0 0
     5     t          6  0  6 0 0 0
     6      s         7  0  7 0 0 0
     7       e        8  0  9 0 0 0
     8        t       9 13  9 0 0 0
     9        |.     10  0 10 0 0 0
    10        | w    11  0 11 0 0 0
    11        |  w   12  0 12 0 0 0
    12        |   w   0  0 13 1 0 0
    13        n      14  0 14 0 0 0
    14         .     15  0 15 0 0 0
    15          w    16  0 16 0 0 0
    16           w   17  0 17 0 0 0
    17            w   0  0  0 1 0 0

    In 7, the RRTodo has looped back to the beginning as this block was traversed again, went past the last child.

    This method ensures that urls are not crawled sequentially – far from it in fact.

    Note: Later on I will find this system is very slow due to the recursion and vast sparse block lookups and pointer jumping…

    Odds and Ends

    February 28th, 2010

    Several small, but crucial improvements have been made:

    • The URL class now ignores URLs which start with ‘mailto’ or ‘javascript’ and considers them invalid. (Remember, they are in the eyes of the crawler even if they are not in reality!)
    • Accidentally inserted double-forward-slashes are removed
    • A quick tool was made to quickly change the Remote Development Host in Netbeans of all projects in one go by editing the xml project files
      • I often switch between a local virtual machine for coding and the server cluster for debugging
    • More work (though nothing significant) was done on the xtrie

    X(treme) Trie

    February 27th, 2010

    Firstly, and trivially, #fragments are now stripped from URLs.

    Now, mysql is an obvious performance bottleneck when the ‘To Do’ service is asking it for a random row, it is (understandably) not the quickest operation to SELECT and then DELETE a row from the middle of the row collection, as it will re-shuffle the data after deletion to avoid fragmentation.

    The plan is to use a trie which can be used to tell if the URL has been seen before, by existence of an entry in the trie, but also if it has already been crawled by using flags in the data stream. Here goes…