X(treme) Trie

Firstly, and trivially, #fragments are now stripped from URLs.

Now, mysql is an obvious performance bottleneck when the ‘To Do’ service is asking it for a random row, it is (understandably) not the quickest operation to SELECT and then DELETE a row from the middle of the row collection, as it will re-shuffle the data after deletion to avoid fragmentation.

The plan is to use a trie which can be used to tell if the URL has been seen before, by existence of an entry in the trie, but also if it has already been crawled by using flags in the data stream. Here goes…

Leave a Reply