Skip to content

[RFC] Add reverse Command to PPL (Calcite Engine) #3873

@selsong

Description

@selsong

Problem Statement
In many analytical and observability queries, users want to inspect data in reverse order, for example:

  • Viewing the most recent log lines last instead of first.
  • Extracting the last N rows in the dataset.
  • Reversing the effect of a sort without re-specifying fields.
    Currently, PPL does not support a way to reverse the order of results in a query pipeline. This limits query flexibility and forces users to use workarounds or rely on external sorting.

Current State

  • PPL supports sort, which orders records by specified fields, but does not offer a simple way to reverse record order without explicitly specifying sort criteria again
  • There is no current native support for a reverse operation in either Calcite or Spark engines

Long-Term Goals

  • Provide a simple, intuitive mechanism (reverse) for users to invert row order in a PPL pipeline.
  • Establish clear guidelines and patterns for lightweight, non-parametric commands in PPL.
  • Avoid introducing performance bottlenecks by enforcing pushdown optimization or limiting use cases on large datasets.

Proposal

Syntax

<source> | [commands] | reverse | [commands]
  • reverse takes no arguments.
  • It can appear anywhere in the pipeline including the end, and will reverse the order of rows produced up to that point.
  • Unlike sort, which allows field-based sorting, reverse simply inverts the row order as is.

Semantics

Expected Behavior:

  • Action: Reverses the display order of all rows in the current result set
  • Scope: Operates on the entire result set at the point where reverse appears in the pipeline
  • Data Preservation: Does not modify field values or schema

Alternative

  • sort - requires field: helpful but doesn't support reversing arbitrary pipeline outputs, and would require re-specifying field names.
  • reverse works even when no sort is specified, or when you're working with filtered, projected, or unsorted data.

Implementation Discussion
Default Implementation: To implement reverse, we insert a logical sort with a synthetic column based on ROW_NUMBER():

LogicalSort(reverse_row_num DESC)
  Project(..., reverse_row_num = ROW_NUMBER() OVER ())
    [upstream operator]

Pushdown Implementation: (TBD, will be in 2nd followup PR)
The optimization is triggered when the logical plan has this exact shape:

LogicalSort (reverse sort)
  CalciteLogicalIndexScan

Conditions for pushdown [not implemented yet]:

  • Reverse should be able to be pushed down when preceded by sort by flipping the direction of the sort from asc to desc or vice versa.
  • However, pushdown is disabled if a filter or limit appears before reverse, or if the reverse requires a local fallback implementation using ROW_NUMBER().

Example Pushdown Optimization:

curl -X POST "localhost:9200/_plugins/_ppl/_explain" \
-H "Content-Type: application/json" \
-d '{
  "query": "source = employees | reverse | head 5"
}'

If reverse appears directly after the source, it matches this optimized pattern:

Sort DESC (fetch=5)
  IndexScan

This enables pushdown of the reverse and head commands into the scan phase.

   LogicalProject(skills=[$0], performance_score=[$1], level=[$2], employee_id=[$3], name=[$4], hire_date=[$5], department=[$6], remote=[$7], salary=[$8], age=[$9], email=[$10])
  LogicalSort(sort0=[$11], dir0=[DESC], fetch=[5])
    CalciteLogicalIndexScan(table=[[OpenSearch, employees]])

Engine Support

  • Currently implemented only in Calcite engine, not supported in Spark engine
  • Once [FEATURE] Export PPL-Calcite Engine as Reusable Library (#3734) is completed, reverse can be reused in the Spark engine.

Performance

  • When reverse follows a sort on large datasets (ex: 20K) the sort is applied to all 20K values, then the reverse is applied to all 20K values properly, then on that final result OpenSearch returns a limited 10K values due to the automatically imposed limit, as if head 10000 was applied after the reverse.

Open Question: Semantics When sort Precedes reverse

Query:

curl -X POST "localhost:9200/_plugins/_ppl/_explain" \
-H "Content-Type: application/json" \
-d '{
  "query": "source = employees | sort salary | reverse"
}'

Logical Plan:

LogicalProject(...)
  LogicalSort(sort0=[$17], dir0=[DESC])
    LogicalProject(..., reverse_row_num=[ROW_NUMBER() OVER ()])
      LogicalSort(sort0=[$8], dir0=[ASC-nulls-first])
        CalciteLogicalIndexScan(table=[[OpenSearch, employees]])

Answer: The first result returned is the highest salary as expected, which is 20000 in this case.
source = employees | sort salary | reverse | head 1 will return the item with salary = 20000. We delay head or limit pushdown until after reverse, ensuring the entire result set is available to reverse when appropriate, avoiding truncated or misleading results.

Open Question

  • Should reverse pushdown be disabled or restricted when preceded by sort or multiple sorts to avoid partial results and incorrect output?
  • Should usage of reverse be limited to queries bounded by head or similar limits?
  • Should a configuration flag exist to disable or warn about reverse usage on large unbounded result sets?

Metadata

Metadata

Assignees

No one assigned

    Labels

    PPLPiped processing language

    Type

    No type

    Projects

    Status

    Done

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions