Rolling window queries #2

ilya-biryukov · 2021-08-20T15:38:44Z

CubeStore extension to implement rolling window measures in CubeJS.
Tests are in the CubeStore repository.

The syntax is of the form:

SELECT dim, key1, key2, other,
       ROLLING(SUM(x) RANGE BETWEEN 7 PRECEDING AND UNBOUNDED FOLLOWING)
       ROLLING(AVG(x) RANGE BETWEEN 7 PRECEDING AND UNBOUNDED FOLLOWING)
FROM input
ROLLING_WINDOW
  DIMENSION dim
  PARTITION BY key1, key2
  FROM 0 to 10 EVERY 2

Semantics are roughly:

compute rolling window aggregations over input data,
window "rolls over" the DIMENSION column, only values defined in
the range by FROM .. TO .. EVERY .. are reported,
each "group" defined by columns in PARTITION BY is handled
and reported independently,

Current limitations:

only ranges with up to 10M points are supported to avoid infinite
loops are accidental DOS. This is still a fairly large limit and
can lead to DOS given enough input data.

CubeStore extension to implement rolling window measures in CubeJS. Tests are in the CubeStore repository. The syntax is of the form: ``` SELECT dim, key1, key2, other, ROLLING(SUM(x) RANGE BETWEEN 7 PRECEDING AND UNBOUNDED FOLLOWING) ROLLING(AVG(x) RANGE BETWEEN 7 PRECEDING AND UNBOUNDED FOLLOWING) FROM input ROLLING_WINDOW DIMENSION dim PARTITION BY key1, key2 FROM 0 to 10 EVERY 2 ``` Semantics are roughly: - compute rolling window aggregations over input data, - window "rolls over" the `DIMENSION` column, only values defined in the range by `FROM .. TO .. EVERY ..` are reported, - each "group" defined by columns in `PARTITION BY` is handled and reported independently, Current limitations: - only integer ranges (timestamps and intervals are coming), - only ranges with up to 10M points are supported to avoid infinite loops are accidental DOS. This is still a fairly large limit and can lead to DOS given enough input data.

This required `date_add` implementation from CubeStore, it is now in DataFusion repository. Also allow `PRECEDING` and `FOLLOWING` on both window frame bounds.

* # This is a combination of 3 commits. # This is the 1st commit message: Add Display for Expr::BinaryExpr # This is the commit message #2: Update logical_plan/operators tests # This is the commit message #3: rebase and debug display for non binary expr * Add Display for Expr::BinaryExpr Update logical_plan/operators tests rebase and debug display for non binary expr Add Display for Expr::BinaryExpr Update logical_plan/operators tests Updating tests Update aggregate display Updating tests without aggregate More tests Working on agg/scalar functions Fix binary_expr in create_name function and attendant tests More tests More tests Doc tests Rebase and update new tests * Submodule update * Restore submodule references from master Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>

* [feat] stubs for provider re-organization * [feat] implement infer_schema to make test pass * [wip] trying to implement pruned_partition_list * [typo] * [fix] replace enum with trait for extensibility * [fix] add partition cols to infered schema * [feat] forked file format executors avro still missing * [doc] comments about why we are flattening * [test] migrated tests to file formats * [test] improve listing test * [feat] add avro to refactored format providers * [fix] remove try from new when unnecessary * [fix] remove try_ from ListingTable new * [refacto] renamed format module to file_format also removed statistics from the PartitionedFile abstraction * [fix] removed Ballista stubs * [fix] rename create_executor * [feat] added store * [fix] Clippy * [test] improve file_format tests with limit * [fix] limit file system read size * [fix] avoid fetching unnecessary stats after limit * [fix] improve readability * [doc] improve comments * [refacto] keep async reader stub * [doc] cleanup comments * [test] test file listing * [fix] add last_modified back * [refacto] simplify csv reader exec * [refacto] change SizedFile back to FileMeta * [doc] comment clarification * [fix] avoid keeping object store as field * [refacto] grouped params to avoid too_many_arguments * [fix] get_by_uri also returns path * [fix] ListingTable at store level instead of registry * [fix] builder take self and not ref to self * Replace file format providers (#2) * [fix] replace file format providers in datafusion * [lint] clippy * [fix] replace file format providers in ballista * [fix] await in python wrapper * [doc] clearer doc about why sql() is async * [doc] typos and clarity * [fix] missing await after rebase

* Optimize `regex_replace` for scalar patterns * Change the hot-path on `regexp_replace` to only variadic source (#2)

* Initial commit * initial commit * failing test * table scan projection * closer * test passes, with some hacks * use DataFrame (#2) * update README * update dependency * code cleanup (#3) * Add support for Filter operator and BinaryOp expressions (#4) * GitHub action (#5) * Split code into producer and consumer modules (#6) * Support more functions and scalar types (#7) * Use substrait 0.1 and datafusion 8.0 (#8) * use substrait 0.1 * use datafusion 8.0 * update datafusion to 10.0 and substrait to 0.2 (#11) * Add basic join support (#12) * Added fetch support (#23) Added fetch to consumer Added limit to producer Added unit tests for limit Added roundtrip_fill_none() for testing when None input can be converted to 0 Update src/consumer.rs Co-authored-by: Andy Grove <andygrove73@gmail.com> Co-authored-by: Andy Grove <andygrove73@gmail.com> * Upgrade to DataFusion 13.0.0 (#25) * Add sort consumer and producer (#24) Add consumer Add producer and test Modified error string * Add serializer/deserializer (#26) * Add plan and function extension support (#27) * Add plan and function extension support * Removed unwraps * Implement GROUP BY (#28) * Add consumer, producer and tests for aggregate relation Change function extension registration from absolute to relative anchor (reference) Remove operator to/from reference * Fixed function registration bug * Add test * Addressed PR comments * Changed field reference from mask to direct reference (#29) * Changed field reference from masked reference to direct reference * Handle unsupported case (struct with child) * Handle SubqueryAlias (#30) Fixed aggregate function register bug * Add support for SELECT DISTINCT (#31) Add test case * Implement BETWEEN (#32) * Add case (#33) * Implement CASE WHEN * Add more case to test * Addressed comments * feat: support explicit catalog/schema names in ReadRel (#34) * feat: support explicit catalog/schema names in ReadRel Signed-off-by: Ruihang Xia <waynestxia@gmail.com> * fix: use re-exported expr crate Signed-off-by: Ruihang Xia <waynestxia@gmail.com> Signed-off-by: Ruihang Xia <waynestxia@gmail.com> * move files to subfolder * RAT * remove rust.yaml * revert .gitignore changes * tomlfmt * tomlfmt Signed-off-by: Ruihang Xia <waynestxia@gmail.com> Co-authored-by: Daniël Heres <danielheres@gmail.com> Co-authored-by: JanKaul <jankaul@mailbox.org> Co-authored-by: nseekhao <37189615+nseekhao@users.noreply.github.com> Co-authored-by: Ruihang Xia <waynestxia@gmail.com>

github-actions bot added the datafusion label Aug 20, 2021

ilya-biryukov added 2 commits August 20, 2021 18:39

Timestamp dimensions for rolling window queries

168d932

This required `date_add` implementation from CubeStore, it is now in DataFusion repository. Also allow `PRECEDING` and `FOLLOWING` on both window frame bounds.

ilya-biryukov force-pushed the cs-rolling branch from f4c2f19 to 168d932 Compare August 20, 2021 15:40

ilya-biryukov merged commit a365e19 into cube Aug 20, 2021

MazterQyou pushed a commit that referenced this pull request Feb 17, 2023

Optimize regex_replace for scalar patterns (apache#3614)

15c19c3

* Optimize `regex_replace` for scalar patterns * Change the hot-path on `regexp_replace` to only variadic source (#2)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Rolling window queries #2

Rolling window queries #2

Uh oh!

ilya-biryukov commented Aug 20, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Rolling window queries #2

Rolling window queries #2

Uh oh!

Conversation

ilya-biryukov commented Aug 20, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants