Improved tree node cache functionality (2 of 3) #2099

qiyunzhu · 2024-08-04T20:06:28Z

This PR follows #2082 and partially addresses #2083. It improves the caching mechanism of TreeNode.

Previously, every node in a tree has three private attributes: _tip_cache, _non_tip_cache and _registered_caches. They are largely empty or duplicate. This is very inefficient, especially in memory. #2082 removed the necessity of the former two. The current PR removed the last one. Now they are attached to the root node only in a lazy mode.

Specifically, _tip_cache, _non_tip_cache are lookup tables which facilitate searching nodes by name. They are automatically invoked in find and find_all. Meanwhile, _registered_caches stores names of attributes assigned to each node. These attributes are assigned in a post order traversal using the cache_attr method.

The cache_attr method is very useful in saving tree traversal time for many applications. So far, it is only causally used in the majority_rule function to calculate a consensus tree given multiple trees. The current PR significantly expanded its functionality. Now, one can calculate node attributes such as: 1) number of nodes descending from each node, 2) total sum of branch lengths of each clade, 3) accumulative distances from all tips to each node. Calculating these properties is much faster with the improved cache_attr than calling alternative methods individually for multiple times. The docstring of this method has been enriched to provide these examples.

TODO: The last important thing is to make cache clearance optional in every single tree manipulation operation. Currently, even appending a single node requires traversing the entire tree to remove the caches. This is very inefficient. Some methods even requires many calls of the clearance operation. All of them need to be optimized to save unnecessary traversals.

At least the following methods need to be improved: append, extend, insert, remove, unroot, pop, shuffle. Each method is also used by other methods and they need to be optimized as well. For example, append is used in copy, bifurcate, unpack_by_func, from_taxonomy and from_linkage_matrix.

Please complete the following checklist:

I have read the contribution guidelines.
I have documented all public-facing changes in the changelog.
This pull request includes code, documentation, or other content derived from external source(s). If this is the case, ensure the external source's license is compatible with scikit-bio's license. Include the license in the licenses directory and add a comment in the code giving proper attribution. Ensure any other requirements set forth by the license and/or author are satisfied.
- It is your responsibility to disclose code, documentation, or other content derived from external source(s). If you have questions about whether something can be included in the project or how to give proper attribution, include those questions in your pull request and a reviewer will assist you.
This pull request does not include code, documentation, or other content derived from external source(s).

Note: This document may also be helpful to see some of the things code reviewers will be verifying when reviewing your pull request.

mataton

My only comments are minor changes to the documentation. The actual code looks good to me.

mataton · 2024-08-05T17:51:48Z

skbio/tree/_tree.py

+
+        Cache a list of all descending tip names on each node. This faciliates the
+        assignment of taxon set under each clade in the tree. It resembles but is more
+        efficient than calling ``subset`` for multiple times.


Can remove the word 'for' from this line.

mataton · 2024-08-05T17:53:55Z

skbio/tree/_tree.py

+
+        Cache the number of nodes per clade. The function ``sum`` is used in place of
+        cache type such that the count will be accumulated. This resembles but is more
+        efficient than calling ``count`` for multiple times.


Same here, 'for' may be removed.

mataton · 2024-08-05T17:55:03Z

skbio/tree/_tree.py

+        7
+
+        Cache the sum of branch lengths per clade. This resembles but is more efficient
+        than calling ``descending_branch_length`` for multiple times. Note: the result


Here as well.

mataton · 2024-08-05T17:55:41Z

skbio/tree/_tree.py

+        clade. This allows one to measure the depth of a clade from the surface (tips)
+        of a tree. One can further apply calculations like mean and standard deviation
+        to the results. This is more efficient than calling ``accumulate_to_ancestor``
+        for multiple times. Also note that the result includes the stem branch of each


I think this is the last 'for' which may be removed.

mataton · 2024-08-05T18:31:18Z

skbio/tree/_tree.py

+        Examples
+        --------
+        >>> from skbio.tree import TreeNode
+        >>> tree = TreeNode.read(["((a,b)c,(d,e)d,(f,g)c);"])


May be illustrative to include the tree ASCII art here.

qiyunzhu · 2024-08-05T19:22:05Z

@mataton Thank you! I have addressed your comments.

qiyunzhu added 5 commits August 3, 2024 21:50

progress

4bf355f

resolved conflicts

ff360ea

merged conflicts

d5805f7

upgraded tree node cache (2 of 3)

b1f21f5

fixed typos

bf7bf88

qiyunzhu requested a review from mataton August 4, 2024 20:27

mataton approved these changes Aug 5, 2024

View reviewed changes

addressed Matt's comments

05ab075

qiyunzhu merged commit a089ac0 into scikit-bio:main Aug 7, 2024
39 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improved tree node cache functionality (2 of 3) #2099

Improved tree node cache functionality (2 of 3) #2099

qiyunzhu commented Aug 4, 2024 •

edited

Loading

mataton left a comment

mataton Aug 5, 2024

mataton Aug 5, 2024

mataton Aug 5, 2024

mataton Aug 5, 2024

mataton Aug 5, 2024

qiyunzhu commented Aug 5, 2024

Improved tree node cache functionality (2 of 3) #2099

Improved tree node cache functionality (2 of 3) #2099

Conversation

qiyunzhu commented Aug 4, 2024 • edited Loading

mataton left a comment

Choose a reason for hiding this comment

mataton Aug 5, 2024

Choose a reason for hiding this comment

mataton Aug 5, 2024

Choose a reason for hiding this comment

mataton Aug 5, 2024

Choose a reason for hiding this comment

mataton Aug 5, 2024

Choose a reason for hiding this comment

mataton Aug 5, 2024

Choose a reason for hiding this comment

qiyunzhu commented Aug 5, 2024

qiyunzhu commented Aug 4, 2024 •

edited

Loading