Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improved tree node cache functionality (2 of 3) #2099

Merged
merged 6 commits into from
Aug 7, 2024

Conversation

qiyunzhu
Copy link
Contributor

@qiyunzhu qiyunzhu commented Aug 4, 2024

This PR follows #2082 and partially addresses #2083. It improves the caching mechanism of TreeNode.

Previously, every node in a tree has three private attributes: _tip_cache, _non_tip_cache and _registered_caches. They are largely empty or duplicate. This is very inefficient, especially in memory. #2082 removed the necessity of the former two. The current PR removed the last one. Now they are attached to the root node only in a lazy mode.

Specifically, _tip_cache, _non_tip_cache are lookup tables which facilitate searching nodes by name. They are automatically invoked in find and find_all. Meanwhile, _registered_caches stores names of attributes assigned to each node. These attributes are assigned in a post order traversal using the cache_attr method.

The cache_attr method is very useful in saving tree traversal time for many applications. So far, it is only causally used in the majority_rule function to calculate a consensus tree given multiple trees. The current PR significantly expanded its functionality. Now, one can calculate node attributes such as: 1) number of nodes descending from each node, 2) total sum of branch lengths of each clade, 3) accumulative distances from all tips to each node. Calculating these properties is much faster with the improved cache_attr than calling alternative methods individually for multiple times. The docstring of this method has been enriched to provide these examples.

TODO: The last important thing is to make cache clearance optional in every single tree manipulation operation. Currently, even appending a single node requires traversing the entire tree to remove the caches. This is very inefficient. Some methods even requires many calls of the clearance operation. All of them need to be optimized to save unnecessary traversals.

At least the following methods need to be improved: append, extend, insert, remove, unroot, pop, shuffle. Each method is also used by other methods and they need to be optimized as well. For example, append is used in copy, bifurcate, unpack_by_func, from_taxonomy and from_linkage_matrix.


Please complete the following checklist:

  • I have read the contribution guidelines.

  • I have documented all public-facing changes in the changelog.

  • This pull request includes code, documentation, or other content derived from external source(s). If this is the case, ensure the external source's license is compatible with scikit-bio's license. Include the license in the licenses directory and add a comment in the code giving proper attribution. Ensure any other requirements set forth by the license and/or author are satisfied.

    • It is your responsibility to disclose code, documentation, or other content derived from external source(s). If you have questions about whether something can be included in the project or how to give proper attribution, include those questions in your pull request and a reviewer will assist you.
  • This pull request does not include code, documentation, or other content derived from external source(s).

Note: This document may also be helpful to see some of the things code reviewers will be verifying when reviewing your pull request.

@qiyunzhu qiyunzhu requested a review from mataton August 4, 2024 20:27
Copy link
Collaborator

@mataton mataton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My only comments are minor changes to the documentation. The actual code looks good to me.


Cache a list of all descending tip names on each node. This faciliates the
assignment of taxon set under each clade in the tree. It resembles but is more
efficient than calling ``subset`` for multiple times.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can remove the word 'for' from this line.


Cache the number of nodes per clade. The function ``sum`` is used in place of
cache type such that the count will be accumulated. This resembles but is more
efficient than calling ``count`` for multiple times.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here, 'for' may be removed.

7

Cache the sum of branch lengths per clade. This resembles but is more efficient
than calling ``descending_branch_length`` for multiple times. Note: the result
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here as well.

clade. This allows one to measure the depth of a clade from the surface (tips)
of a tree. One can further apply calculations like mean and standard deviation
to the results. This is more efficient than calling ``accumulate_to_ancestor``
for multiple times. Also note that the result includes the stem branch of each
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is the last 'for' which may be removed.

Examples
--------
>>> from skbio.tree import TreeNode
>>> tree = TreeNode.read(["((a,b)c,(d,e)d,(f,g)c);"])
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

May be illustrative to include the tree ASCII art here.

@qiyunzhu
Copy link
Contributor Author

qiyunzhu commented Aug 5, 2024

@mataton Thank you! I have addressed your comments.

@qiyunzhu qiyunzhu merged commit a089ac0 into scikit-bio:main Aug 7, 2024
39 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants