A waste of time

For the last week or so I have been forced to conclude that the xtrie is no good for this high performance need. In single threaded use it is too slow, and with locking and multi-threaded use it is, if anything, slower but more importantly suffers from deadlocks (which I accept is going to be due to my locking code).

Unfortunately I have a nice big table of results that proves my xtrie is not fit for this purpose…

I devised two locking methods, with no locking it would take t time to create and then read the contents of the trie, with method 1 it took 4t and method 2 took 10t.

Some values also seemed to get lost in the trie, in other words they would go in, but not come out – and upon examining the raw data it seemed like the trie marked them as done even if they were not.

Instead of continuing with this, I am taking a different approach, one which will provide much greater performance gains for less effort.

The paged queue will replace the xtrie, the main emphasis to achieve performance is to uphold the rule (newly created) that data should be read and written as sequentially as possible, with minimal jumping (seeking) around the data file as this provides best performance.

  • The Insert Service receives URLs to be crawled from the crawler, these are inserted into a temporary queue.
  • One of two events will cause the temporary queue to be dumped and processed
    • The queue hits a size threshold
    • The current page is empty
  • When the temporary queue is processed all the domains of URLs are counted so that the most common can be identified
  • If the most common domain consisted of more than n% (50% at most) of the URLs then entries were removed until this was no longer the case
    • Any removed domains were put to one side and inserted back into the temporary queue at the end
  • The URLs left were then sorted so that no two URLs in sequence had the same domain
    • For example the domains 0 1 1 2 2 2 3 3 3 4 4 4 4
    • 4 is most common and makes up 4/13 of the entries
    • They would be sorted:
    • 4 3 2 4 1 3 0 2 4 1 3 2 4
    • This is not exactly how my algorithm sorts them, it tries to maintain maximum diversity
    • If any URLs have been removed to preserve diversity they are added to the temporary page after it is wiped clean
  • Once sorted they are saved in a page in that order where they can be read sequentially by the ToDo service
  • A Page Manager keeps track of the current page in use, and free pages that can be filled after processing
  • Once a page is emptied by the ToDo service it is wiped clean and marked as free so it can be used at a later date.
  • Most data is kept on the file system to protect against software failure
    • The temporary queue is stored on the file system
    • The current page has a maker written at the last entry requested so that if the service crashes it can recover from where it left off
    • The Page Manager stores all its working data in a small data file in case of crashes
    • Temp page processing is done in memory, but the temporary page is only wiped after the processing is done
    • In memory data structures are kept as small as possible and queues/lists are cleaned as soon as their not needed to minimise memory footprint
      • In theory if the temporary page threshold is 1M then it should not use much more memory for processing as only one copy of the data is maintained at any one time

Potential Improvements

Processing cases the main thread to wait and currently is a bottleneck that will need to be addressed later on.

The diversity threshold could vary based on historical data.

The Cache

The cache is now the tried again, still called the PowerCache Service. It uses block level caching and supports but does not currently use locking as it only adds overhead.

The trie in use isĀ referredĀ to as an fclTrie (fast, cached, locked).

Leave a Reply