Underappreciated Perl Modules–Tree::Trie

Before the Perl community settled on the term “hashes”, many called them “associative arrays”. This was a common term for an array that looked up elements by a string among dynamic languages in the old days, such as Tcl.

Although the hash algorithm is the popular choice in modern times, it wasn’t always this way. Binary search trees were also popular, but have the disadvantage that they have a worst-case time of O(n), and an average time of O(log(n)) (where n is the number of elements). Hashes lookup keys in constant time; they stay the same speed no matter how many elements you have.

There’s a caveat to hashes though: they have large constant factors, so a binary search tree tends to be faster for a small number of elements. They also don’t come back in any obvious order unless you make them that way by alternative means.

Hashes also have a problem where the value a string hashes to may collide with another string’s hash. When that happens, the value is appended to the list. On lookup, we then have to walk that list. This means that while hashes are generally close to constant time, they actually have a worst-case time of O(n), just like binary search trees. Since hashes are used so much on Perl, it tends to be very careful about colliding hash values, but it’s still a weakness in the algorithm.

This is where the trie comes in (usually pronounced “try”). A trie will lookup keys in O(m) time, where m is the length of the lookup string. Note that hashes actually have this limitation, too. It takes at least O(m) time to hash the string so we can look it up.

Oh, and you can get back the keys in sorted order by doing a pre-order traversal.

Oh, and you can get back a list of keys that match a prefix with practically no additional overhead.

All else being equal, a trie will be on par or better than the speed of hashes. However, in Perl, not all is equal. Perl’s hash implementation is handled by the language internally at the C level. Any trie library you use will go through subroutine/method lookups, which will be substantially slower. If we were rewritting Perl from scratch, maybe you could argue for replacing hashes with tries. But if you were rewritting Perl from scratch, that’s probably pretty low on the list of changes.

But never mind that. Sorted returns and prefix matching are still pretty cool if you need that sort of thing.

Tree::Trie looks like a good place to start experimenting with tries. To get behavior equivalent to hashes, you need to use the methods marked *_data. The other methods only store keys without a value.

Some example code

That’s all I have to say on tries. If you have suggestions for future underappreciated Perl modules, feel free to leave a comment below.

Article Global Facebook Twitter Myspace Friendfeed Technorati del.icio.us Digg Google StumbleUpon Eli Pets

Tagged , , , . Bookmark the permalink.

One Response to Underappreciated Perl Modules–Tree::Trie

  1. Joe Z says:

    Tries aren’t a panacea, though. They work well when the keys often share prefixes. Otherwise, the constant penalty for a trie lookup is typically much larger than for a hash, and the storage overhead is significantly greater.

    Tries are great with dictionary words, for example. You don’t have to flip through many pages of an English dictionary to realize that there’s quite a lot of redundancy in word prefixes across the language. So, if you’re building up an index of unique words in a document, a trie is often an excellent choice. Likewise for a catalog of identifiers in a program, since programmers often use common prefixes for related identifiers. IP addresses also naturally form prefix-oriented hierarchies.

    Tries work less well for more random strings, such as temporary file names, email addresses (unless the email database is long enough), sparse arrays, and so on. (Although, I imagine you could make an argument that email addresses would work better if you reverse the string first, since the strongest-correlated hierarchical elements are at the end (.com, .org, .net, domain names, etc.))

    The main space overhead/complication in a trie comes in the N-ary tree you need to store the trie. You need a way to represent a tree that can have as many as N children at each node, where N is the total number of valid characters. If you’re using 7-bit ASCII, that’s a branching factor of 128. If you’re using Unicode code-points… you need an extensible structure here. :-) I’m certain I’ve encountered (and since forgotten) clever ways to keep the representation compact, but I’m pretty certain it’s still more complicated than hash buckets and chaining.

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code class="" title="" data-url=""> <del datetime=""> <em> <i> <q cite=""> <strike> <strong> <pre class="" title="" data-url=""> <span class="" title="" data-url="">