Skip to content

Conversation

@LantaoJin
Copy link
Member

@LantaoJin LantaoJin commented Apr 27, 2025

Description

Support eventstats command with Calcite

The eventstats command actually equals to window function in SQL

source=t | eventstats avg(a), count() by b

equals to

SELECT avg(a) OVER (PARTITION BY b), count(*) OVER (PARTITION BY b) FROM t

And the default window frame is ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING

In this PR, the supported window functions are

  • max
  • min
  • sum
  • count
  • avg
  • var_samp
  • var_pop
  • stddev_samp
  • stddev_pop

We will support more window functions in follow-up PRs since they are not supported in PPL-Spark

   : ROW_NUMBER
   | RANK
   | DENSE_RANK
   | PERCENT_RANK
   | CUME_DIST
   | FIRST
   | LAST
   | NTH
   | NTILE

Related Issues

Resolves #3563

Check List

  • New functionality includes testing.
  • New functionality has been documented.
  • New functionality has javadoc added.
  • New functionality has a user manual doc added.
  • API changes companion pull request created.
  • Commits are signed per the DCO using --signoff.
  • Public documentation issue/PR created.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

Signed-off-by: Lantao Jin <ltjin@amazon.com>
@LantaoJin LantaoJin added PPL Piped processing language calcite calcite migration releated labels Apr 27, 2025
Signed-off-by: Lantao Jin <ltjin@amazon.com>

private final UnresolvedExpression windowFunction;

private final List<UnresolvedExpression> windowFunctionList;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need to add window boundary? For example, I want to execute a ppl equal to a sql like
select AVG(SUM(total_amount)) OVER (ROWS BETWEEN 2 PRECEDING AND CURRENT ROW) AS moving_avg_3day (this is one of the moving avg, which we plan to support for t2visbuilder).

Copy link
Member Author

@LantaoJin LantaoJin Apr 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's easy to support it. but not this PR of eventstats command. We can extend this syntax after PPL lang unified. e.g support boundary when we implement streamstats command.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have added a WindowFrame for further using.

Signed-off-by: Lantao Jin <ltjin@amazon.com>
* ``eventstats``: Useful when you need to enrich events with statistical context for further analysis or filtering. Can be used mid-search to add statistics that can be used in subsequent commands.


Version
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice!
Could we highlight version info? for instance, in title, evenstats (since 3.1.0). or maybe using version controled doc? Any thoughts? @dai-chen @qianheng-aws

Copy link
Member Author

@LantaoJin LantaoJin Apr 29, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

highlighted in description and docs/user/ppl/index.rst


Example::

PPL> source=accounts | eventstats sum(age) by gender;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add an example explain how to handle missing value?

Comment on lines 396 to 399
if (BuiltinFunctionName.AVG == agg.get()) {

} else {
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

delete it?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Comment on lines +56 to +59
rows("John", "Canada", "Ontario", 4, 2023, 25, 4, 36.25, 20, 70),
rows("Jake", "USA", "California", 4, 2023, 70, 4, 36.25, 20, 70),
rows("Jane", "Canada", "Quebec", 4, 2023, 20, 4, 36.25, 20, 70),
rows("Hello", "USA", "New York", 4, 2023, 30, 4, 36.25, 20, 70));
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in scheam, 2nd column is age, but in result, 2nd column is country?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The data order is the actually schema order. Let me change to verifySchemaInOrder

Signed-off-by: Lantao Jin <ltjin@amazon.com>
@LantaoJin LantaoJin requested review from dai-chen, penghuo and xinyual May 1, 2025 13:21
penghuo
penghuo previously approved these changes May 2, 2025
@penghuo
Copy link
Collaborator

penghuo commented May 2, 2025

@dai-chen please take another look.


static RexNode makeOver(
CalcitePlanContext context,
String name,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[minor] Why not just pass in BuiltinFunctionName here?

Copy link
Member Author

@LantaoJin LantaoJin May 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

RexWindowBound lowerBound = convert(context, windowFrame.getLower());
RexWindowBound upperBound = convert(context, windowFrame.getUpper());
switch (functionName) {
// There is no "avg" AggImplementor in Calcite, we have to change avg window
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a Calcite rule AggregateReduceFunctionsRule, it will do such transformation of rewriting avg to sum/count. Could we just use avg here and leverage that rule for transformation? It should introduce more refined handling for cases like NULL values.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for informing this, let me check how to apply this rule.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After checking AggregateReduceFunctionsRule code, seems it could be applied by logical Aggregate, rather than logical Window. The converting made by SqlNode level (parser). We can open a follow-up ticket about self-dev rule for window.

}
}

public void pushWindowPartitions(List<RexNode> partition) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems never used for these 3 APIs. And why should the window related context stored in this global context?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will delete them. It is useless code.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

.map(
groupCtx ->
(UnresolvedExpression)
new Alias(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[question] Why do we always need to wrap byClause with Alias here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could be a followup and it needs much code refactor for stats, not sure the original reason.

Signed-off-by: Lantao Jin <ltjin@amazon.com>
Signed-off-by: Lantao Jin <ltjin@amazon.com>
qianheng-aws
qianheng-aws previously approved these changes May 6, 2025
Signed-off-by: Lantao Jin <ltjin@amazon.com>
Signed-off-by: Lantao Jin <ltjin@amazon.com>
@LantaoJin LantaoJin requested review from penghuo and qianheng-aws May 6, 2025 13:17
@LantaoJin LantaoJin merged commit e51fcd9 into opensearch-project:main May 7, 2025
22 checks passed
penghuo pushed a commit that referenced this pull request Jun 16, 2025
* Support eventstats command with Calcite

Signed-off-by: Lantao Jin <ltjin@amazon.com>

* add doc

Signed-off-by: Lantao Jin <ltjin@amazon.com>

* fix IT

Signed-off-by: Lantao Jin <ltjin@amazon.com>

* address comments

Signed-off-by: Lantao Jin <ltjin@amazon.com>

* Fix conflicts

Signed-off-by: Lantao Jin <ltjin@amazon.com>

* address comments

Signed-off-by: Lantao Jin <ltjin@amazon.com>

* Support variance functions

Signed-off-by: Lantao Jin <ltjin@amazon.com>

* fix UT

Signed-off-by: Lantao Jin <ltjin@amazon.com>

* update doc

Signed-off-by: Lantao Jin <ltjin@amazon.com>

---------

Signed-off-by: Lantao Jin <ltjin@amazon.com>
Signed-off-by: xinyual <xinyual@amazon.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

calcite calcite migration releated PPL Piped processing language

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[FEATURE] Support eventstats command with Calcite

5 participants