Description
#2786 added the new tsk_tree_position_t
class which takes care of tracking the edges that need to go in and out of a tree in order to transform it into a target tree. This works really well for the basic next
and prev
operations, and there is an implementation of seek_forward
, which gives a straightforward way of transforming a tree into any other tree in the sequence, touching the minimum number of edges. It's worth looking at some of the key bits of code here:
# The range of edges we need consider for removal starts
# at the current right index and ends at the first edge
# where the right coordinate is equal to the new tree's
# left coordinate.
j = right_current_index
self.out_range.start = j
# TODO This could be done with binary search
while j < M and right_coords[right_order[j]] <= left:
j += 1
self.out_range.stop = j
# The range of edges we need to consider for the new tree
# must have right coordinate > left
j = left_current_index
while j < M and right_coords[left_order[j]] <= left:
j += 1
self.in_range.start = j
# TODO this could be done with a binary search
while j < M and left_coords[left_order[j]] <= left:
j += 1
self.in_range.stop = j
self.interval.left = left
self.interval.right = breakpoints[index + 1]
Here we fill in the in_range
and out_range
ranges of indexes into the left and right edge sortings which must be examined. We can be sure that we don't need to look at any edges outside of these ranges. Then, on the client side, we do something like this:
old_left, old_right = self.tree_pos.interval
self.tree_pos.seek_forward(index)
left, right = self.tree_pos.interval
for j in range(self.tree_pos.out_range.start, self.tree_pos.out_range.stop):
e = self.tree_pos.out_range.order[j]
e_left = self.ts.edges_left[e]
# We only need to remove an edge if it's in the current tree, which
# can only happen if the edge's left coord is < the current tree's
# right coordinate.
if e_left < old_right:
c = self.ts.edges_child[e]
assert self.parent[c] != -1
self.parent[c] = -1
assert e_left < left
for j in range(self.tree_pos.in_range.start, self.tree_pos.in_range.stop):
e = self.tree_pos.in_range.order[j]
if self.ts.edges_left[e] <= left < self.ts.edges_right[e]:
c = self.ts.edges_child[e]
p = self.ts.edges_parent[e]
self.parent[c] = p
I think this works quite well, and should correspond to something that works well in practise.
HOWEVER there is a problem: the edges that are removed and inserted are not necessarily in time-sorted order. This is a basic assumption that we lean on for incremental algorithms because it guarantees postorder-like behaviour, where we do the minimum amount of work in order to propagate information up the tree. In pathological cases, I think this could lead to O(tree height^2) performance on incremental algorithms that propagate information up the tree (like sample counting).
Unfortunately, this also applies to the current implementation of seek_from_null. Since we see great performance for this in benchmarks, I guess this means that either we've turned off sample counting for these benchmarks (which I doubt), or we have the possibility for some extra performance. See #2661 for more details.
I think the simplest thing to do initially would be to implement seek_from_null
using the tree_pos
based approach outlined above, and to defer propagating sample counts until after all the edges have been inserted. We can then do it using a standard postorder algorithm (which is as good as you're going to do anyway, if you insert the edges in the right time-order way).
How we approach the problem for the general seeking around the tree case, is more subtle. I guess we could either
- Live with the edges going in and out in unsorted order, because it's probably going to be rare that we happen to do them in an order that's genuinely bad
- Collect the edges we're examining and sort them so they are in order (but it's not obvious that this is actually better than the worst case?)