Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Garbage collection in ibd_segments, link_ancestors and simplify #2461

Open
gtsambos opened this issue Aug 10, 2022 · 1 comment
Open

Garbage collection in ibd_segments, link_ancestors and simplify #2461

gtsambos opened this issue Aug 10, 2022 · 1 comment
Labels
enhancement New feature or request Performance This issue addresses performance, either runtime or memory

Comments

@gtsambos
Copy link
Member

All three of these methods rely on an object called ancestor_map (C codebase) or A (Python mockups) to store segments that descend from each ancestral node in the tree sequence. In the case of ibd_segments and link_ancestors, these can be a big memory hog (much bigger than the size of the tree sequences themselves). However, the segments corresponding to a given ancestral node are no longer needed once we've processed all of the edges with that node in the 'child' position of the edge table. By pruning away this garbage periodically, we could improve the memory usage of these methods (at the expense of some runtime).

The difficulty is in the freeing of memory -- since the ancestor_map objects have their memory allocated all at once, they also have to be freed all at once. It's difficult to imagine changing this without substantially complicating the existing code.

one alternative is to periodically create a pruned copy of the ancestor_map, and to then replace the original version with the pruned copy. I'm about to push up a PR that demonstrates this.

@gtsambos gtsambos added enhancement New feature or request Performance This issue addresses performance, either runtime or memory labels Aug 10, 2022
@jeromekelleher
Copy link
Member

One thing you could do is to use the same approach as msprime, and to use something like its object heap which is specifically designed for repeated reuse of the same memory.

It would actually be simpler in many ways to just malloc and free the segments as they are used (that is, free all the segments for a given node once you are done with them). Maybe it's worth mocking this up as a quick test and seeing how it performs (and whether we really do make significant memory gains from it)?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request Performance This issue addresses performance, either runtime or memory
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants