Skip to content

[RFC] Implement streamstats command in PPL #4207

@ishaoxy

Description

@ishaoxy

Problem Statement

Now OpenSearch PPL lacks the implementation of streamstats. This streamstats command is conceptually different from the traditional stats command because it performs cumulative or rolling statistical calculations in a streaming manner as events are processed, rather than computing final aggregates after the entire dataset is scanned.

While OpenSearch PPL already supports aggregation through the stats and eventstats command, it does not currently provide streaming aggregations or window-based cumulative metrics similar to streamstats.

Differences between stats, eventstats and streamstats

All of these commands can be used to generate aggregations such as average, sum, and maximum. The key difference is that stats in PPL aggregates data globally or by groups in a single result row per group, while eventstats performs similar aggregations but retains the original events and appends the aggregated values as new fields. Whereas streamstats updates and outputs running calculations for every incoming event, making it especially useful for cumulative or time-series analyses.

The differences between these commands are summarized in the following table:

Feature stats eventstats streamstats
Transformation Behavior Converts all events into an aggregation result table, removing original event structure. Adds aggregation results as new fields to the original events while keeping the original structure. Calculates cumulative or sliding statistics event-by-event within the stream and adds them as new fields.
Output Format Outputs only aggregated values, does not retain original events. Retains original events and adds overall summary fields. Retains original events and adds running or window-based statistics fields.
Aggregation Scope Based on all events in the search (or groups defined by BY). Based on all relevant events, then the result is added back to each event in the group. Computes statistics incrementally in event order; supports window, time_window, or reset conditions to limit the scope.
Use Cases When only aggregated results are needed (e.g., total count, average). When aggregated statistics are needed alongside original event data. When cumulative or rolling statistics are required (e.g., running totals, moving averages).
Key Arguments BY <field> for grouping. Same as stats, supports BY and allnum. Additional options: window, time_window, reset_before, reset_after, reset_on_change, current, etc.
Performance Considerations Reduces data volume significantly. Higher memory usage because all events are retained; controlled by max_mem_usage_mb. Stream-based processing; memory usage depends on window size and reset strategy. Controlled by max_mem_usage_mb and maxresultrows.

Syntax

streamstats
    [reset_before="("<eval-expression>")"]
    [reset_after="("<eval-expression>")"]
    [current=<bool>]
    [window=<int>]
    [global=<bool>]
    <stats-agg-term>...
    [<by-clause>]

Required arguments

  • stats-agg-term

Optional arguments

  • by-clause
  • current
  • window
  • global
  • reset_after
  • reset_before

(more details will be in the "docs/user/ppl/cmd/streamstats.rst" TBD)

Usage Examples

Basic Streaming Calculation

-- Running count of consecutive error minutes
... | streamstats count(eval(errorCount>0)) as consecutiveErrorMinutes

Optional Arguments Usage

-- Disable current event contribution and group by host
... | streamstats current=f values(consecutiveErrorMinutes) as previousResults by host

-- Sliding window average over the last 10 events
... | streamstats window=10 avg(count) as avg_count

-- Conditional reset before calculation
... | streamstats count(eval(errorCount>0)) as consecutiveErrorMinutes by source reset_before="REDACTED"match(errorCount)

-- Conditional reset after calculation
... | streamstats reset_after=total_errors<10 c as running_count

-- Group-scoped sliding window with global=f
... | streamstats current=f window=2 global=f last(current_value) as prev_value

Implement RFC

  1. For the Boolean type of current, opensearch strictly controls the input to true/false, but a large number of use cases use “current = f”. Do we need to adapt it?

  2. How to implement arguments reset?…… TBD. dynamic offset/ session partition

  3. Strange behavior for argument global
    -According to the doc, the global argument only works when both the window and by are set. In this case, we should set global to false to achieve group-by-group windowing. However, if global is set to true (or does nothing, as it defaults to true), how should we interpret the query's behavior? Attached below are the documentation and my experimental query results, which leave me somewhat puzzled.

Image Image Image Image

Metadata

Metadata

Assignees

Labels

PPLPiped processing language

Type

No type

Projects

Status

Not Started

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions