Sunday, January 6, 2013

AVL vs. Red-Black: the conclusion

    I accidentally deleted my original blog post about this, and I can't recover it despite trying all sorts of different tricks. So, here it is reposted. Sorry for any broken links.

    I have spent possibly too long thinking about binary search trees, AVL trees, and red-black trees. I have finally backed up my original theory with mathematical proofs and have even gone further and realized more aspects of the behavior of AVL and red-black trees.


Let us first begin with binary search trees:

    The time it takes to search a binary search tree is related to the height of the tree.1 For a complete binary search tree, this is log2(size).1 More precicely, it is log2(size)-1. When inserting data randomly, the performance of the tree is close to O(log2(size)).2 So, randomly inserting data causes the tree to be more balanced and therefore faster for searching and for other operations which depend on searching.

    When inserting data, we must first search to one of the leaves of the tree to figure out where the data is to be inserted. Insertion is simple, but it does depend on searching and so has the same speed as searching until we get into specializations that do more (AVL and red-black trees).

    When erasing data, we must first find the node to erase. If that node has two children, we must swap it with a node that has one child or no children. Then, we will erase the node after it has been swapped. In a balanced tree, where the left height of a subtree is close to the right height of a subtree, a node with one child will have very few total children. This means that it will be close to one of the leaves of the tree. Therefore, the performance of erasing is the same as insertion. It requires a search to a leave of the tree and then a simple erase operation.

    One problem comes when inserting data in order. Ordered data causes the tree to become skewed.2 The worst-case height of the tree is the size of the tree.1 2 This happens when it is skewed because of ordered insertions. AVL and red-black trees employ balancing techniques to keep this worst-case from happening.


Red-black trees:

    When inserting random data into a red-black tree, since it is a specialization of a binary search tree, the results will be mostly balanced. When a rebalance is needed, at most two rotations will need to be done.6 The worst-case number of operations is log2(size),4 5 probably due to traversing up the tree to find the place at which to rebalance.6 7 and code from 5 However, the amortized number of operations is constant.4 6 8

    When erasing random data from a red-black tree, since it is a specialization of a binary search tree, the results will be mostly balanced. When a rotation is needed, at most three rotations will be done.6 As with insertion, the amortized number of operations is constant.6 8

    When inserting data in random into a red-black tree, more balances will need to be done. As mentioned above, binary search trees become skewed with ordered data. The sources above for random data did not mention that the performance is any different for balancing ordered data after an insertion; it is still amortized constant time.

    When erasing random data, again, the sources above did not mention any performance difference for ordered data. It is still amortized constant time.

    As mentioned above, working with random data will create a tree that is approximately balanced. That said, red-black trees can be less rigid than AVL trees.6 9 and implied from 4 If rotations are needed while building the tree, the AVL tree will be more balanced and have a smaller height. Because of this, it is possible that searching a red-black tree will be slower than searching an AVL tree. However, since random data approaches a balanced tree, as mentioned above, the difference will be slight. It will be close to the height of the tree, or log2(size).

    When working with ordered data, the red black tree becomes skewed. When erasing in order, the left side becomes lighter thant he right side. It is not necessary to find an equation for how much lighter. When comparing with an AVL tree that is more rigidly balanced, it will only be necessary to show that the left side of a tree when erasing in order is lighter than the right side, which will be truer for red-black trees because they are less rigidly balanced (stated above).

    When inserting ordered data, it seems as though the red-black tree may fall behind the AVL tree, so a proof is needed for this case. One paragraph will not adequately explain this, and few sources even mention it.


Proof of structure for inserting ascending data:

Let s be the size of a tree
Let sl be the size of its left subtree
Let sr be the size of its right subtree
Let hl be the height of its left subtree
Let hr be the height of its right subtree
Let h be the height of a subtree

    Two adjacent nodes in a red-black tree can have a black height of one by one of the nodes being red. This is often used to make balancing simpler and can be seen in most tutorials.6 7 10

    When a red-black tree gets out of balance, either color changes are necessary or a rotation occurs. When rotating counterclockwise (left) in a binary search tree, the node at which rotation is occurring gets, as its right children, all of the children of its former right child's left children.12

    That said, when inserting in order, the tree becomes skewed to the right as mentioned earlier for binary search trees. The balancing that takes place will set the sibling to black if it is red, or rotate counterclockwise if its sibling is already black. This takes advantage of the fact that red nodes can be used to minimize the number of rotations that need to be done, and it can be seen in the instructions of most tutorials.

    The structure of the tree just before rotation needs to be found so that the range of sizes between rotations can be realized and the percentage of time that the tree is skewed before a rotation and at what level the rotation occurs can be deduced.

    At the smallest subtree, you have a triangle shape with colors R-B-R before rotation. When inserting another node on the right, the leftmost node of the subtree is changed to black and the parent of the inserted node is changed to black. The parent of both is changed to red to maintain the height of the tree. If the outter edge of the entire tree is alternating red and black, as is done when inserting in order so that fewer rotations are needed, a red violation occurs and the parent of the subtree must be changed to black. Because of this that parent's sibling must be changed to black and its parent to red (to keep the black height down). This case propagates all of the way up the tree, changing all but the outter right edge of the tree to all black nodes. The height becomes heavy by one because no more nodes can be changed to red when the root is reached, and a rotation must occur.

    In the red-black tree, the right subtree of a subtree must have the same black height as the left subtree. Since the outter edge when inserting in order is alternating red and black, the problem must be divided in two for each case. For red nodes on the outter edge, its left child must be black. Because of the all-black children mentioned above after rotations, that black node's children are also black. Its size is 2^(black_height-1)-1, or 2^(h-1)-1.

    If a subtree on the right side of the tree has a root of black, then its child may be either red or black per the red-black tree rules. However, since it must have the same black height of its sibling, and the rotations change its left node to red, its left node cannot be changed to black without making its parent unbalanced. The nodes below its red left child, however, will all be black as deduced above. The size of the left subtree of a black node is thus 2^(black_height+1-1)-1, or 2^(h-1)-1 since h=h_b+1 in this case.

    Because the outter edge of a red-black tree after inserting in order is alternating red and black, starting at red just below the root and ending at red at the rightmost node (which has an imaginary black child), both the number of red nodes and the number of black nodes is the black height. Thus, h_b = (hr/2), and the subtrees from the bottom up to hr/2 should be counted. Since hr does not count the imaginary black node, h_b = (hr+1)/2, and the upper bound of where to count should be (hr+1)/2.

    Combining the realizations of the past three paragraphs, where r(h) is the size left of a red outter node, and b(h) is the size left of a black outter node, the size of the right subtree of the root is r(1)+1 + b(2)+1 + r(2)+1 ... + b((hr+1)/2)+1 + r((hr+1)/2)+1. Notice that b(1) is not in the equation, and that is because it refers to an imaginary leaf node.

    The black height of the left subtree of the root is log2(sl+1) since all nodes are black and it is a complete tree due to the rotations and parts of the algorithm mentioned above.

    The relationship between the height of the left subtree (hl) and that of the right subtree (hr) is hr = 2*hl+1. This is because the height of the left subtree is one less than the black height, but the height of the right subtree is one more than twice its black height. Just before rotation, it begins with a red node and ends with a red node.

Given:
hl = log2(sl+1)
hr = 2*hl+1
sr = s - sl - 1
sr = r(1)+1 + b(2)+1+r(2)+1 ... + b((hr+1)/2)+1+r((hr+1)/2)+1
r(h) = r(b) = 2^(h-1)-1

Want:
hl in terms of s
hr in terms of s

hl = log2(sl+1)
2^hl = sl + 1
2^hl - 1 = sl
sl = 2^hl - 1

sr = r(1)+1 + b(2)+1+r(2)+1 ... + b((hr+1)/2)+1+r((hr+1)/2)+1
sr = r(1)+1 + 2*(r(2)+1) ... + 2*(r((hr+1)/2)+1)
sr = 2^(1-1)-1+1 + 2*(2^(2-1)-1+1) ... + 2*(2^((hr+1)/2-1)-1+1)
sr = 2^(1-1) + 2*2^(2-1) ... + 2*2^((hr+1)/2-1)
sr = 2^0 + 2^2 ... + 2^((hr+1)/2)
sr = 1 + 2^2 ... + 2^((hr+1)/2)
sr ~= 1 + 1*(4-2^((hr+1)/2+1))/(1-2)
sr ~= 1 + (4-2^((hr+1)/2+1))/(-1)
sr ~= 1 - (4 - 2^((hr+1)/2+1))
sr ~= -3 + 2^((hr+1)/2+1)

sr = s - sl - 1
sr + sl + 1 = s
s = sr + sl + 1

s ~= (-3 + 2^((hr+1)/2+1)) + (2^hl - 1) + 1
s ~= -3 + 2^((hr+1)/2+1) + 2^hl

s ~= -3 + 2^((2*hl+1+1)/2+1) + 2^hl
s ~= -3 + 2^((2*hl+2)/2+1) + 2^hl
s ~= -3 + 2^(hl + 1 + 1) + 2^hl
s ~= -3 + 2^(hl + 2) + 2^hl
s + 3 ~= 2^(hl + 2) + 2^hl
log2(s+3) ~= log2( 2^(hl+2) + 2^hl )
log2(s+3) ~= log2(2^(hl+2)) + log2( 1 - 2^(hl+2)/2^hl )
log2(s+3) ~= hl+2 + log2( 1 - 2^(hl+2-hl) )
log2(s+3) ~= hl+2 + log2( 1 - 2^2 )
log2(s+3) ~= hl+2 + log2(1)/log2(4)
log2(s+3) ~= hl+2 + 0
hl ~= log2(s+3) - 2

hr = 2*hl+1
hr = 2 * (log2(s+3)-2) + 1
hr = 2*log2(s+3) - 4 + 1
hr = 2*log2(s+3) - 3

Therefore,
hl ~= log2(s+3) - 2
hr = 2*log2(s+3) - 3

    To double-check the math done above, a tree was built by inserting data in ascending order on an online red-black tree applet.11

sslsrhlhrhlhr
1001101
20112.321.64
31123.582.17
41223.812.61
5132313
614241.173.34
715241.323.64
834341.463.98
935341.584.17
1036351.704.40
1137351.814.61
1238351.914.81
13393525
14310362.095.18
15311362.175.34
16312362.255.50
17313362.325.64
18710462.395.78
19711462.465.92
20712462.526.05
21713462.586.17
22714472.646.29
23715472.706.40
24716472.756.51
25717472.816.61
26718472.866.72
27719472.916.81
28720472.956.91
297214737
30722483.047.09
31723483.097.17
32724483.137.26
33725483.177.34
34726483.217.42
35727483.257.50
36728483.297.57
37729483.327.64
381522583.367.72

    Notice that the equations above give slightly lower values, but the values are close. Since we are comparing the speed of red-black trees to that of AVL trees, lower values will not hurt the red-black tree's timings. As long as the computed values are close and not greater than the actual values, the comparison will be valid if the AVL tree beats the computed values.


AVL trees:

    When inserting random data into an AVL tree, since it is a specialization of a binary search tree, the results will be mostly balanced. When a rebalance is needed, at most two rotations will need to be done.13 14 However, most sources agree that the time to rebalance the tree is logarithmic (O(log2(size)).15 16 One source states that it is constant time.17 Even though only at most two rotations need to be done, the point at which to rotate needs to be found. The same is true for a red-black tree (as mentioned earlier), but the AVL tree is more rigidly balanced (also mentioned earlier). Perhaps the rebalancing after an insertion is not agreed upon as amortized constant time for AVL trees like for red-black trees due to the fact that the AVL trees will go to greater lengths to balance the tree. For the comparisons, it will be assumed that AVL trees have run in logarithmic time for rebalancing. In either case, the red-black trees balance more quickly.

    When erasing random data from an AVL tree, since it is a specialization of a binary search tree, the results will be mostly balanced. When a rotation is needed, up to log2(size) rotations can take place.13 17 The number of operations to rebalance the tree after a rotation is up to log2(size).13 15 16

    The time for both inserting and erasing random data is therefore proportional to log2(size). It may be slightly faster for insertion, though. For simplicity's sake, and because random data is not the primary concern being investigated, that fact will be ignored for the comparisons later.

    When working with ordered data, the AVL tree becomes skewed like any binary search tree. The worst-case skew for an AVL tree yields a height of approximately 1.44*log2(size).3 14 15 18 However, this worst-case is not accomplished by inserting in order. Ordered insertion needs to be investigated further.


Proof of structure for inserting ascending data:

Let s be the size of a subtree
Let sl be the size of its left subtree
Let sr be the size of its right subtree
Let h be the height of a tree
Let hl be the height of its left subtree
Let hr be the height of its right subtree

    When inserting the first two values, no rotations are needed. After another insertion, the tree needs rotated to have the height it had beforehand. When a fourth value is inserted, the subtree is not unbalanced, but it has a height that is one more than it had before. It it has a parent, it will need rotated since initially the height was 1 and now it is 3. Since h=3, hl=1, and hr=2, after the rotation at the parent (whose hl is 1, following the pattern), the parent's hl becomes 2 and its hr becomes 2. The tree becomes complete.

    The logic propagates up the tree. After balancing, the tree is nearly complete. Indeed, if the size is one less than a power of two, the AVL tree is a complete tree.

    Since the tree is complete, every size that is one less than a power of two has equal hl and hr and equal sl and sr. Thus, the search time is equal to that of a perfectly balanced tree, or log2(s).

    The time to rotate after a balance is still near log2(s). Thus, the total time to insert a value in ascending order is 2*log2(s). This will be used in the comparisons.


Comparisons of red-black and AVL trees:

    As mentioned earlier, the search time of all binary search trees is the height at the worst case. Also, inserting or erasing a node requires a traversal to the bottom of the tree and so is related to the average height of the tree before the insertion or deletion occurs.

    In a red-black tree, rebalancing after inserting or erasing data runs in amortized constant time. This is true for random data and is likely true for ordered data as well.

    In an AVL tree, rebalancing after inserting or erasing data runs in logarithmic time related to the height of the tree. For insertion, less operations are needed, but most sources still state that it is logarithmic.

    To compare inserting ordered data between the trees, the time it takes to search for the rightmost node needs to be found, and then that needs to be added to the time it takes to rebalance the tree.

    To compare erasing ordered data between the trees, the time it takes to search for the leftmost node needs to be found, and then that needs to be added to the time it takes to rebalance the tree.

    For inserting random data, both trees will find the insertion point in close to log2(size) operations. The AVL tree will likely do slightly less operations than the red-black tree, though. Rebalancing after insertion is amortized constant time in red-black trees and logarithmic in AVL trees. Thus, the time it takes to insert an element in a red-black tree is log2(size)+C, and the time it takes to insert an element in an AVL tree is near 2*log2(size).

    For erasing random data, both trees will find the insertion point in close to log2(size) operations (with the AVL tree doing slightly less operations than the red-black tree). Rebalancing after erasing is amortized constant time for red-black trees and logarithmic in AVL trees. Just like with insertion, red-black trees do deletions of random data in log2(size)+C, and AVL trees do deletions of random data in 2*log2(size).

    For inserting ordered data, it was found that red-black trees become skewed to the right. The rightmost node has a hight greater than 2*log2(size+3)-3. Rebalancing after an insertion takes amortized constant time, though. Thus, the time it takes a red-black tree to insert data is 2*log2(size+3)-3+C.

    For inserting ordered data into an AVL tree, it was found that a complete tree is formed. The total time to insert data was also found to be almost 2*log2(size). This is more efficient than for red-black trees.

    For erasing ordered data, because red-black trees are less rigidly balanced than AVL trees, the leftmost node can be closer to the root of the tree. Searching for that node is faster in the red-black tree than in the AVL tree, and because red-black trees are also faster at rebalancing after a deletion, deleting data in order is faster in a red-black tree.


Conclusions:

    The algorithms for red-black trees are more efficient at inserting and erasing data, except in the case of inserting ordered data in which AVL trees are slightly more efficient. Searching is close to the same for both trees when working with random data, but AVL trees come out ahead. For sorted data, AVL trees are also more efficient (possibly unless searching for the lowest value when erasing in order).

    The main purpose was to prove that ordered insertion was more efficient in AVL trees, and that has been shown to be true.


Implementation concerns:

    The algorithms for red-black treescompare 5 6 7 10 11 15 are more complex than those for AVL treescompare 3 13 14 15 16 18 . Because of this and the fact that the log2(size) on which the algorithmic complexities are based is very small for small numbers (it is only 24 for 16,777,216), implementations of AVL trees may outperform red-black trees on some computers that do not have the available memory to test higher sizes. Indeed, my benchmarks19 show that AVL trees outperform red-black trees, and this is because my computer does not have the RAM to test sizes large enough for the logarithmic edge of red-black trees to start taking effect.


Footnotes and references: (references were accessed May 8, 2012)

1 - http://www.personal.kent.edu/~rmuhamma/Algorithms/MyAlgorithms/binarySearchTree.htm
2 - http://eternallyconfuzzled.com/tuts/datastructures/jsw_tut_bst1.aspx
3 - http://en.wikipedia.org/wiki/AVL_tree
4 - http://attractivechaos.wordpress.com/2008/10/02/comparison-of-binary-search-trees/
5 - http://www.cs.auckland.ac.nz/~jmor159/PLDS210/red_black.html
6 - http://en.wikipedia.org/wiki/Red%E2%80%93black_tree
7 - http://nathanbelue.blogspot.com/2012/04/red-black-trees.html
8 - http://opendatastructures.org/versions/edition-0.1d/ods-java/node46.html
9 - http://xlinux.nist.gov/dads/HTML/redblack.html
10 - http://eternallyconfuzzled.com/tuts/datastructures/jsw_tut_rbtree.aspx
11 - http://gauss.ececs.uc.edu/RedBlack/redblack.html
12 - http://en.wikipedia.org/wiki/Tree_rotation
13 - http://eternallyconfuzzled.com/tuts/datastructures/jsw_tut_avl.aspx
14 - http://cis.stvincent.edu/carlsond/swdesign/avltrees/avltrees.html
15 - http://www.cs.oberlin.edu/~jdonalds/151/lecture17.html
16 - http://www.cs.auckland.ac.nz/~jmor159/PLDS210/AVL.html
17 - http://www.daniweb.com/software-development/cpp/threads/230593/time-complexity-avl-tree
18 - http://pages.cs.wisc.edu/~siff/CS367/Notes/AVLs/
19 - http://nathanbelue.blogspot.com/2012/05/red-black-versus-avl.html

5 comments:

  1. a very very good Article , Thanks a lot .
    Mehdi Valeh / Iran / University Of Tehran .

    ReplyDelete
  2. I like this article and will bookmark it. I think you should definitely look at other studies and more closely at implementation.

    One study shows red black trees performing better which might be because of implementation but also because of the data. Once you go beyond time complexity implementation is a whole new can of worms which can vary from platform to platform. To be fair the difference is rarely enough to write home about though.

    You can also consider things such as whether you keep a parent link, a linked list in order, BTree (2 items is a special case because you would naturally unroll rather than have the over head of iteration), etc. There are a lot of subtle aspects of the trees when it comes to implementation. Between platforms there can be slight differences that might favour one tree over the other.

    On the rotate topic, a lot of people talk about it only needing two rotations but this is an abstraction. In implementation you would not rotate twice but act on the data structures directly. The cost between one or two rotations is roughly the same. It might even be faster sometimes for two rotations which is actually a straight swap.

    ReplyDelete
    Replies
    1. That's a good point regarding implementation.

      I did run some benchmarks for up to 900,000 elements. In general, the AVL trees were faster at insertion (when compared to a similar red-black tree). However, the fastest implementation of the two that I found was actually the red-black tree in the linux source code!

      So yea, implementation plays a pretty big role.

      Delete
  3. You say: "Rebalancing after insertion is amortized constant time in red-black (RB) trees and logarithmic in AVL trees." More precisely: in RB trees rebalancing after update (insertion or deletion) is amortized constant time and logarithmic in the worst case; in AVL trees it is amortized constant time and logarithmic in the worst case for insertions and average constant time and logarithmic in the worst case for deletions. Average constant because the probability of the necessity to climb a level towards the root is < q < 1 which sums up to a bounded number.
    What I was unable to find out and do not understand: Why is there such a big difference between search after random update and search after update in order ? Do you use different search routines ? I understand that updates in order kind of skew the tree, e.g. insert ascendingly makes the right half of the tree heavier because every insertion hits the right half. This should have the consequence that the distance from root to leaves on the right side (= the side of insertion) is greater than expected from log2(size). Moreover, I would expect this skew being bigger with RB than with AVL, because RB minimal trees (see OEIS A027383) are less balanced than Fibonacci trees. In all, search after insert ascendingly should perform worse than search after insert randomly with both, RB and AVL, a result which is just the opposite of your results. What am I missing ?

    ReplyDelete
    Replies
    1. That is correct. Random data is better than ordered data. I'm not sure I understand where the discrepancy is. For random insertions, searches are close to log2(size) operations. Ordered insertions is close to the same for AVL, but it is "2*log2(size+3)-3" for RB.

      It may be worth noting that I pit the worst case for AVL against ordered insertion with RB. It's not a fair comparison, favoring RB. Since AVL still came out ahead, it must be better with ordered insertion too.

      If you can clarify what numbers of mine you're looking at, I'll take another look.

      Delete