Skip to content

Improve non-GCC implementation of statistics functions Quantile/Rank #77

Closed
@AdamGlustein

Description

@AdamGlustein

Is your feature request related to a problem? Please describe.
Due to the policy-based data structures (PBDS) library being gcc specific, I had to change the implementation of Quantile and Rank in cpp/cppnodes/statsimpl.cpp to not use an order statistics tree. Instead, they use a std::multiset to store sorted values, and iterate on that set in linear time to find each quantile/rank when triggered. This is much more inefficient than the O(log n) lookup in the PBDS tree.

Describe the solution you'd like
It would be helpful if someone implemented a balanced (red-black) order statistics tree (OST) natively in csp. The OST is a binary search tree with the additional invariant that each node contains the size of its own subtree (https://en.wikipedia.org/wiki/Order_statistic_tree). This allows for efficient (log n time) insert, find, delete, and index-based lookup.

Describe alternatives you've considered
There are actually 2 data structures that will achieve O(log n) insert, delete and index-based lookup

  1. The order statistics tree mentioned above
  2. An indexable skiplist. This is what pandas uses for their rolling quantiles. However, pandas has the advantage of static data, so they can size it up front knowing the maximum size of the data interval. In a time-windowed csp stats node, we do not know the maximum number of ticks which will fall in an interval, so sizing becomes a bit trickier (I think we'd have to resize/add links dynamically at powers of 2). Here is a link explaining the impl https://rhettinger.wordpress.com/2010/02/06/lost-knowledge/.

Additional context
This would be helpful for any non gcc users and would also make the code more readable since we're no longer using 2 different approaches for the same stats calculation.

Activity

added
good first issueGood issue for first-time contributors
type: featureIssues and PRs related to new features
lang: c++Issues and PRs related to the C++ codebase
on Feb 13, 2024
AdamGlustein

AdamGlustein commented on Feb 13, 2024

@AdamGlustein
CollaboratorAuthor

May be able to leverage the boost multindex datat structure for this:
https://www.boost.org/doc/libs/1_75_0/libs/multi_index/doc/tutorial/indices.html#rnk_indices

added a commit that references this issue on Jul 15, 2024
c2be3cf
added a commit that references this issue on Jul 20, 2024
c640f04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Metadata

Assignees

No one assigned

    Labels

    good first issueGood issue for first-time contributorslang: c++Issues and PRs related to the C++ codebasetype: featureIssues and PRs related to new features

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Participants

    @AdamGlustein

    Issue actions

      Improve non-GCC implementation of statistics functions Quantile/Rank · Issue #77 · Point72/csp