Skip to content

Avoid associative (and identity) aggregations over the complete partition #176

@lukaseder

Description

@lukaseder

Associative and identity aggregations could be calculated based on the result of the aggregation of the previous value:

v   sum (-3, -1)  current calculation   better
-------------------------------------------------
1   e             e                     e
2   1             1                     e + 1
3   3             1 + 2                 1 + 2
4   6             1 + 2 + 3             3 + 3
5   9                 2 + 3 + 4         6 + 4 - 1
6   12                    3 + 4 + 5     9 + 5 - 2

As the partition is already sorted correctly, we can simply access the (cached) calculated value from lag()

Associative and identity operations currently supported are:

  • count() (optimisation not necessary, as count() is already O(1))
  • sum() (and all derived aggregations)
  • avg() (and all derived aggregations - not strictly associative, but can depend on sum() and count())
  • toList() (a potential optimisation would require an immutable linked list)
  • toString()

More limited optimisation is possible for associative-only aggregations, when using a Long.MIN_VALUE .. 0 frame. These include:

  • min()
  • max()
  • median() (would depend on cached toList())
  • percentile() (would depend on cached toList())
  • allMatch()
  • anyMatch()
  • noneMatch()

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions