-
Notifications
You must be signed in to change notification settings - Fork 269
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improved tree node cache functionality (2 of 3) #2099
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My only comments are minor changes to the documentation. The actual code looks good to me.
skbio/tree/_tree.py
Outdated
|
||
Cache a list of all descending tip names on each node. This faciliates the | ||
assignment of taxon set under each clade in the tree. It resembles but is more | ||
efficient than calling ``subset`` for multiple times. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can remove the word 'for' from this line.
skbio/tree/_tree.py
Outdated
|
||
Cache the number of nodes per clade. The function ``sum`` is used in place of | ||
cache type such that the count will be accumulated. This resembles but is more | ||
efficient than calling ``count`` for multiple times. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here, 'for' may be removed.
skbio/tree/_tree.py
Outdated
7 | ||
|
||
Cache the sum of branch lengths per clade. This resembles but is more efficient | ||
than calling ``descending_branch_length`` for multiple times. Note: the result |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here as well.
skbio/tree/_tree.py
Outdated
clade. This allows one to measure the depth of a clade from the surface (tips) | ||
of a tree. One can further apply calculations like mean and standard deviation | ||
to the results. This is more efficient than calling ``accumulate_to_ancestor`` | ||
for multiple times. Also note that the result includes the stem branch of each |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is the last 'for' which may be removed.
Examples | ||
-------- | ||
>>> from skbio.tree import TreeNode | ||
>>> tree = TreeNode.read(["((a,b)c,(d,e)d,(f,g)c);"]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
May be illustrative to include the tree ASCII art here.
@mataton Thank you! I have addressed your comments. |
This PR follows #2082 and partially addresses #2083. It improves the caching mechanism of
TreeNode
.Previously, every node in a tree has three private attributes:
_tip_cache
,_non_tip_cache
and_registered_caches
. They are largely empty or duplicate. This is very inefficient, especially in memory. #2082 removed the necessity of the former two. The current PR removed the last one. Now they are attached to the root node only in a lazy mode.Specifically,
_tip_cache
,_non_tip_cache
are lookup tables which facilitate searching nodes by name. They are automatically invoked infind
andfind_all
. Meanwhile,_registered_caches
stores names of attributes assigned to each node. These attributes are assigned in a post order traversal using thecache_attr
method.The
cache_attr
method is very useful in saving tree traversal time for many applications. So far, it is only causally used in themajority_rule
function to calculate a consensus tree given multiple trees. The current PR significantly expanded its functionality. Now, one can calculate node attributes such as: 1) number of nodes descending from each node, 2) total sum of branch lengths of each clade, 3) accumulative distances from all tips to each node. Calculating these properties is much faster with the improvedcache_attr
than calling alternative methods individually for multiple times. The docstring of this method has been enriched to provide these examples.TODO: The last important thing is to make cache clearance optional in every single tree manipulation operation. Currently, even appending a single node requires traversing the entire tree to remove the caches. This is very inefficient. Some methods even requires many calls of the clearance operation. All of them need to be optimized to save unnecessary traversals.
At least the following methods need to be improved:
append
,extend
,insert
,remove
,unroot
,pop
,shuffle
. Each method is also used by other methods and they need to be optimized as well. For example,append
is used incopy
,bifurcate
,unpack_by_func
,from_taxonomy
andfrom_linkage_matrix
.Please complete the following checklist:
I have read the contribution guidelines.
I have documented all public-facing changes in the changelog.
This pull request includes code, documentation, or other content derived from external source(s). If this is the case, ensure the external source's license is compatible with scikit-bio's license. Include the license in the
licenses
directory and add a comment in the code giving proper attribution. Ensure any other requirements set forth by the license and/or author are satisfied.This pull request does not include code, documentation, or other content derived from external source(s).
Note: This document may also be helpful to see some of the things code reviewers will be verifying when reviewing your pull request.