Skip to content

Efficient seeking to trees. #2661

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Apr 21, 2023
Merged

Conversation

molpopgen
Copy link
Member

@molpopgen molpopgen commented Dec 12, 2022

Adds tree indexing to tsk_treeseq_t.
The indexes allow efficient advancing to arbitrary trees.

Fixes #2659

@codecov
Copy link

codecov bot commented Dec 12, 2022

Codecov Report

Merging #2661 (865caf5) into main (a8e9f67) will increase coverage by 0.01%.
The diff coverage is 92.55%.

Impacted file tree graph

@@            Coverage Diff             @@
##             main    #2661      +/-   ##
==========================================
+ Coverage   89.82%   89.83%   +0.01%     
==========================================
  Files          29       30       +1     
  Lines       28514    28612      +98     
  Branches     5561     5587      +26     
==========================================
+ Hits        25613    25704      +91     
- Misses       1653     1654       +1     
- Partials     1248     1254       +6     
Flag Coverage Δ
c-tests 86.24% <93.58%> (+0.04%) ⬆️
lwt-tests 80.13% <ø> (ø)
python-c-tests 67.17% <81.25%> (+0.04%) ⬆️
python-tests 99.05% <100.00%> (-0.01%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
python/_tskitmodule.c 88.35% <86.66%> (+<0.01%) ⬆️
c/tskit/trees.c 90.60% <93.58%> (+0.06%) ⬆️
python/tskit/trees.py 98.76% <100.00%> (-0.01%) ⬇️

... and 2 files with indirect coverage changes


Continue to review full report in Codecov by Sentry.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update a8e9f67...865caf5. Read the comment docs.

@molpopgen molpopgen marked this pull request as ready for review December 12, 2022 19:05
@molpopgen
Copy link
Member Author

test_trees segfaults when updating samples lists. I'm having trouble tracking down what I've broken.

@molpopgen
Copy link
Member Author

Note to self: I may be not considering the direction of iteration during setup.

@jeromekelleher
Copy link
Member

Ahhh, now I get it!

In big O terms, this PR seeks to a random tree in O(num_edges + num_samples), since we iterate over all edges in order to pull out those that intersect with the tree's interval, and then build that tree. We don't incur a height term here, because the tree gets built from the bottom up, so there's never any propagating of stuff from a newly attached subtree upwards to root.

But, using the standard code I think the complexity is O(num_edges + num_samples + num_trees * height) (The first tree costs num_samples to build, in worst case we iterate over all edges and in worst case at each tree we do O(height) work.) which isn't that much different. I think you could probably come up with examples where the current linear scan does quite a bit better, since it's rare we visit all edges in the table, this happens for every seek here.

The O(num_edges) cost is the problem, and is what we discussed in #684. Without something that's O(log(num_edges) + num_samples) for seeking to an arbitrary tree, I'm not all that enthusiastic about the extra complexity, as I think it'll be quite hard to figure out how to tune the current approach.

What sort of performance difference are you observing in your Rust experiments @molpopgen?

@molpopgen
Copy link
Member Author

molpopgen commented Dec 13, 2022

I see a general 5x-ish improvement in tree access times unless the trees being sought are those at the edges of the tree sequence.

Edited: 10x is the max gain. It will be closer to 5x on average.

@molpopgen
Copy link
Member Author

@jeromekelleher -- this is a benchmark result that I uploaded to the tskit Slack way back when.

benchmark.pdf

@molpopgen
Copy link
Member Author

molpopgen commented Dec 13, 2022

@jeromekelleher -- here is another one from Slack, comparing to the current seek method for a larger tree sequence. Circles are the current tskit, purple diamonds are my early experiments. This shows that the benefit does vary depending on what the tree index is, but the new method is better except at the edges.

Also, the recent Python prototype is similar in being about an order of magnitude faster for an arbitrary seek (again, except for the trees at the ends).

with_seek

@molpopgen
Copy link
Member Author

molpopgen commented Dec 13, 2022

I just revisited my rust repo, and here is tskit/rust relative performance on a large tree sequence with 10k diploid samples. The time is the time seek to the i-th tree, measured every 5e3 trees.

relative_perf

@molpopgen
Copy link
Member Author

@jeromekelleher -- I'll save on uploading more images, but I seem to be iterating towards something w/my rust prototype that is always faster than the current implementation, at least for the first half of the trees ;)

@petrelharp
Copy link
Contributor

Perhaps you know this, but I got curious, and the problem we want to solve here is the "interval stabbing problem"; here is an approach that is O(num_edges * log(num_edges). (it's O(n) but that's because they assume endpoints are O(n) integers so they can sort by bucket sort; we'd need to sort by left and right endpoint, which we already have in the edge insertion and removal orders?)

@jeromekelleher
Copy link
Member

Thanks for the plots @molpopgen, these are really helpful. I've no doubt we could do better at the "starting from NULL seek to the kth tree" problem (which is what we're timing here I think, since that's interesting from the parallel computation viewpoint). We probably don't need to store the additional three num_trees arrays to do this, we just need to skip "adding in" edges that aren't in the tree we're seeking to. If we had a specialised seek_from_null function that we called internally for this, I think it would work quite well.

@jeromekelleher
Copy link
Member

jeromekelleher commented Dec 14, 2022

Something like this:

    def seek(self, position):
        """
        Seek to the tree at the position.
        """
        sequence_length = self.sequence_length
        M = self.edges_left.shape[0]
        in_order = self.edge_insertion_order
        out_order = self.edge_removal_order
        edges_left = self.edges_left
        edges_right = self.edges_right
        edges_parent = self.edges_parent
        edges_child = self.edges_child

        j = 0
        k = 0
        left = 0
        right = -1

        while j < M or left <= sequence_length:
            while k < M and edges_right[out_order[k]] == left:
                k += 1
            while j < M and edges_left[in_order[j]] == left:
                if edges_left[in_order[j]] <= position < edges_right[in_order[j]]:
                    p = edges_parent[in_order[j]]
                    c = edges_child[in_order[j]]
                    self.insert_edge(p, c)
                j += 1
            right = sequence_length
            if j < M:
                right = min(right, edges_left[in_order[j]])
            if k < M:
                right = min(right, edges_right[out_order[k]])

            if left <= position < right:
                # We're in the target tree
                break
            left = right
       # In the C code we'd now make sure to set the various counters so that we can
       # advance in either direction based on j and k

A quick numba version I just knocked up on some arbitrary simulations was 2X faster than the ts.at(ts.sequence_length / 2). I'd b willing to bet that a proper C version would be very similar to yours, without the extra memory overhead.

@molpopgen
Copy link
Member Author

molpopgen commented Dec 14, 2022

Here are some raw timings. I went nuts and rewrote all this in rust, but I was smarter this time and just made tsk_tree_insert_edge public, which save a lot of hassle.

The two methods can be made very close. The indexing approach has what looks like a 2x advantage over what @jeromekelleher posted.

benchmarks

Some details:

  1. Each point is a seek to a tree index.
  2. To do this, we are using the indexes generated by methods like in this PR to find the position for tskit to seek or for @jeromekelleher's method.

@molpopgen
Copy link
Member Author

I have managed to get rid of the positive slope in my last graphic for the indexed method. Working on it for the "jk" approach but I may be out of smart pills/time for the day.

@molpopgen
Copy link
Member Author

Okay, I have a working prototype and will do a do-over on this PR. I managed to speed up @jeromekelleher's method by quite a bit -- you can do the same thing w/less logic. The scale of this graph makes it hard to see, but the jk method is at least 25% faster than what I posted yesterday.

benchmarks

@molpopgen molpopgen changed the title add tsk_tree_index_t Efficient seeking to trees. Dec 15, 2022
@molpopgen
Copy link
Member Author

@jeromekelleher -- I pushed a completely new version. It'd be helpful for you to have a look. I haven't dug into figure out why the failing tests fail.

@molpopgen
Copy link
Member Author

Note to self: the final test is failing b/c something is broken with respect to advancing trees. I can reproduce this in my other prototype.

Copy link
Member

@jeromekelleher jeromekelleher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great, but I don't think it's correct at the moment. Seeking from either end won't make any difference I think, so I'd just go in one direction to make things simpler.

You need to reason about the edge_removal index also, that's currently missing here.

@molpopgen
Copy link
Member Author

@jeromekelleher -- I'm quite puzzled now. If you look at the last commits, I am comparing what I think the tree is to what tskit generates. For some reason, sample tracking seems to not be working when seeking to a tree using the new method.

@molpopgen
Copy link
Member Author

@jeromekelleher -- I'm quite puzzled now. If you look at the last commits, I am comparing what I think the tree is to what tskit generates. For some reason, sample tracking seems to not be working when seeking to a tree using the new method.

I think that this is due to the semantics of setting up tracked samples -- the flag is not processed during tree initialization.

@molpopgen molpopgen force-pushed the indexed_tree_access branch 2 times, most recently from c7af485 to abcc4e8 Compare December 23, 2022 15:21
@molpopgen
Copy link
Member Author

I think this one is close to ready.

@jeromekelleher
Copy link
Member

Fixing that segfault may be fairly straightforward, it just wasn't obvious to me when I looked at it last night how it could happen - the code in question looked watertight to me.

@molpopgen molpopgen force-pushed the indexed_tree_access branch from 19d2b90 to c2a4b47 Compare April 5, 2023 21:16
@molpopgen
Copy link
Member Author

@jeromekelleher -- the valgrind issue and segfaults are due to undefined behavior introduced here (line 4636 in case it isn't clear from the link).

@jeromekelleher
Copy link
Member

Doh, was that my fault? Sorry!

@molpopgen
Copy link
Member Author

molpopgen commented Apr 6, 2023

So we're back to a situation where all the tests are passing, etc.. Shall I squash this one?

@jeromekelleher
Copy link
Member

Yes please!

@molpopgen molpopgen force-pushed the indexed_tree_access branch from a0dbc4e to 6763118 Compare April 6, 2023 10:07
@molpopgen
Copy link
Member Author

AFAICT, this looks good to go?

@jeromekelleher
Copy link
Member

Yep, looks good, just needs a few admin things like updating changelogs, linking to issues and some docs. Leave it with me.

I just added a commit to use the c seek_index implementation.

@molpopgen
Copy link
Member Author

"jupyter-book not found" 😆

@jeromekelleher
Copy link
Member

Opened #2740 to track updating the docs. This is ready to go now once we get CI working again.

@jeromekelleher
Copy link
Member

@Mergifyio rebase

@mergify
Copy link
Contributor

mergify bot commented Apr 21, 2023

rebase

✅ Branch has been successfully rebased

@jeromekelleher jeromekelleher added the AUTOMERGE-REQUESTED Ask Mergify to merge this PR label Apr 21, 2023
@mergify mergify bot merged commit 7ddcc95 into tskit-dev:main Apr 21, 2023
@mergify mergify bot removed the AUTOMERGE-REQUESTED Ask Mergify to merge this PR label Apr 21, 2023
@molpopgen molpopgen deleted the indexed_tree_access branch April 21, 2023 11:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add tsk_tree_seek_index
3 participants