XTrie & PowerCache

The XTrie now acts as a to-do mechanism. Details to follow…

The ‘Insert’ and ‘To Do’ services have been merged into one called the ‘PowerCache’ which runs the trie in place of its two predecessor services.

The Hard-Core details

When the URLs are put in to the trie, they are done on a per-character level. Each character is stored as a block if it is new. Each block consists of:

  • The location of the ‘first child’
  • The location of the ‘next peer’
  • The Round-Robin to-do entry (also a location)
  • The ‘flags’

The locations are pointers (64-bit unsigned integers) to other blocks. The flags indicate one of two things, whether the block is terminal, ie is the the end of a string (clarified later), if terminal whether the URL has to be crawled (ToDo) and whether there are child blocks that represent URLs that need to be crawled (otherwise known as inherited ToDo).

Imagine storing two URLs – www.test.com and www.nest.com.

The domain of the URL is reversed to that the root is always from the left (domains are from the right and paths are from the left in URLs), so www.test.com becomes moc.tset.www (and www.test.com/a/path would become moc.tset.www/a/path) which preserves the left-rooted hierarchy.

Now we have moc.tset.www and moc.tsen.www, they will be stored in the trie in the following hierarchical structure (indentations implies a new tree/trie level and pipes | link levels)

m
 o
  c
   .
    t
     s
      e
       t
       |.
       | w
       |  w
       |   w
       n 
        .
         w
          w
           w

This will probably seem wasteful at first but there are reasons for my design decisions. If, instead of each letter the first distinctive prefix was used it would mean that data blocks of different sizes would occur and when/if they were split all data around them would need to be shifted.

In the above example if we look at m, it’s first child will point to o and its next peer will point to null. On the other hand, t‘s first child is . and its next peer is n. The RR-Todo value is usually initialized to the same as the first child.

All data is shown below for the example, hopefully this will clarify how the system works.

ID is the pointer location, entry is the value itself, FC is the first child, NP is the next peer, RR is the Round Robin To Do value, T is the terminal flag, I is the inherited To-Do and D is To Do.

ID Entry         FC NP RR T I D
 1 m              2  0  2 0 1 0
 2  o             3  0  3 0 1 0
 3   c            4  0  4 0 1 0
 4    .           5  0  5 0 1 0
 5     t          6  0  6 0 1 0
 6      s         7  0  7 0 1 0
 7       e        8  0  8 0 1 0
 8        t       9 13  9 0 1 0
 9        |.     10  0 10 0 1 0
10        | w    11  0 11 0 1 0
11        |  w   12  0 12 0 1 0
12        |   w   0  0 13 1 0 1
13        n      14  0 14 0 1 0
14         .     15  0 15 0 1 0
15          w    16  0 16 0 1 0
16           w   17  0 17 0 1 0
17            w   0  0  0 1 0 1

The rows in red above will be followed on the first request for a URL To Do

After the URL has been returned the ToDo flag will be updated, and the inherited ToDo will be updated recursively if it needs to be changed.

ID Entry         FC NP RR T I D
 1 m              2  0  2 0 1 0
 2  o             3  0  3 0 1 0
 3   c            4  0  4 0 1 0
 4    .           5  0  5 0 1 0
 5     t          6  0  6 0 1 0
 6      s         7  0  7 0 1 0
 7       e        8  0 13 0 1 0
 8        t       9 13  9 0 0 0
 9        |.     10  0 10 0 0 0
10        | w    11  0 11 0 0 0
11        |  w   12  0 12 0 0 0
12        |   w   0  0 13 1 0 0
13        n      14  0 14 0 1 0
14         .     15  0 15 0 1 0
15          w    16  0 16 0 1 0
16           w   17  0 17 0 1 0
17            w   0  0  0 1 0 1

Note in green, to ToDo flag of the last w is gone and so the inherited ToDo flags have been cleared, however ID 7 in blue still inherits a ToDo entry as n at 13 carries an inherited ToDo value. Note also, that all RRTodo values  of blocks traversed (1-12) have advanced to the next child if possible (in bold), in this case only the block at 7 has, but if there were several root entries, other than m then this would’ve moved along to the next one.

After the second request the table would look like this:

ID Entry         FC NP RR T I D
 1 m              2  0  2 0 0 0
 2  o             3  0  3 0 0 0
 3   c            4  0  4 0 0 0
 4    .           5  0  5 0 0 0
 5     t          6  0  6 0 0 0
 6      s         7  0  7 0 0 0
 7       e        8  0  9 0 0 0
 8        t       9 13  9 0 0 0
 9        |.     10  0 10 0 0 0
10        | w    11  0 11 0 0 0
11        |  w   12  0 12 0 0 0
12        |   w   0  0 13 1 0 0
13        n      14  0 14 0 0 0
14         .     15  0 15 0 0 0
15          w    16  0 16 0 0 0
16           w   17  0 17 0 0 0
17            w   0  0  0 1 0 0

In 7, the RRTodo has looped back to the beginning as this block was traversed again, went past the last child.

This method ensures that urls are not crawled sequentially – far from it in fact.

Note: Later on I will find this system is very slow due to the recursion and vast sparse block lookups and pointer jumping…

Leave a Reply