-
Notifications
You must be signed in to change notification settings - Fork 181
Description
Problem Statement
In many analytical and observability queries, users want to inspect data in reverse order, for example:
- Viewing the most recent log lines last instead of first.
- Extracting the last N rows in the dataset.
- Reversing the effect of a sort without re-specifying fields.
Currently, PPL does not support a way to reverse the order of results in a query pipeline. This limits query flexibility and forces users to use workarounds or rely on external sorting.
Current State
- PPL supports
sort, which orders records by specified fields, but does not offer a simple way to reverse record order without explicitly specifying sort criteria again - There is no current native support for a reverse operation in either Calcite or Spark engines
Long-Term Goals
- Provide a simple, intuitive mechanism (reverse) for users to invert row order in a PPL pipeline.
- Establish clear guidelines and patterns for lightweight, non-parametric commands in PPL.
- Avoid introducing performance bottlenecks by enforcing pushdown optimization or limiting use cases on large datasets.
Proposal
Syntax
<source> | [commands] | reverse | [commands]
reversetakes no arguments.- It can appear anywhere in the pipeline including the end, and will reverse the order of rows produced up to that point.
- Unlike
sort, which allows field-based sorting,reversesimply inverts the row order as is.
Semantics
Expected Behavior:
- Action: Reverses the display order of all rows in the current result set
- Scope: Operates on the entire result set at the point where
reverseappears in the pipeline - Data Preservation: Does not modify field values or schema
Alternative
sort -requires field: helpful but doesn't support reversing arbitrary pipeline outputs, and would require re-specifying field names.reverseworks even when no sort is specified, or when you're working with filtered, projected, or unsorted data.
Implementation Discussion
Default Implementation: To implement reverse, we insert a logical sort with a synthetic column based on ROW_NUMBER():
LogicalSort(reverse_row_num DESC)
Project(..., reverse_row_num = ROW_NUMBER() OVER ())
[upstream operator]
Pushdown Implementation: (TBD, will be in 2nd followup PR)
The optimization is triggered when the logical plan has this exact shape:
LogicalSort (reverse sort)
CalciteLogicalIndexScan
Conditions for pushdown [not implemented yet]:
- Reverse should be able to be pushed down when preceded by sort by flipping the direction of the sort from asc to desc or vice versa.
- However, pushdown is disabled if a filter or limit appears before reverse, or if the reverse requires a local fallback implementation using ROW_NUMBER().
Example Pushdown Optimization:
curl -X POST "localhost:9200/_plugins/_ppl/_explain" \
-H "Content-Type: application/json" \
-d '{
"query": "source = employees | reverse | head 5"
}'
If reverse appears directly after the source, it matches this optimized pattern:
Sort DESC (fetch=5)
IndexScan
This enables pushdown of the reverse and head commands into the scan phase.
LogicalProject(skills=[$0], performance_score=[$1], level=[$2], employee_id=[$3], name=[$4], hire_date=[$5], department=[$6], remote=[$7], salary=[$8], age=[$9], email=[$10])
LogicalSort(sort0=[$11], dir0=[DESC], fetch=[5])
CalciteLogicalIndexScan(table=[[OpenSearch, employees]])
Engine Support
- Currently implemented only in Calcite engine, not supported in Spark engine
- Once [FEATURE] Export PPL-Calcite Engine as Reusable Library (#3734) is completed,
reversecan be reused in the Spark engine.
Performance
- When reverse follows a sort on large datasets (ex: 20K) the sort is applied to all 20K values, then the reverse is applied to all 20K values properly, then on that final result OpenSearch returns a limited 10K values due to the automatically imposed limit, as if head 10000 was applied after the reverse.
Open Question: Semantics When sort Precedes reverse
Query:
curl -X POST "localhost:9200/_plugins/_ppl/_explain" \
-H "Content-Type: application/json" \
-d '{
"query": "source = employees | sort salary | reverse"
}'
Logical Plan:
LogicalProject(...)
LogicalSort(sort0=[$17], dir0=[DESC])
LogicalProject(..., reverse_row_num=[ROW_NUMBER() OVER ()])
LogicalSort(sort0=[$8], dir0=[ASC-nulls-first])
CalciteLogicalIndexScan(table=[[OpenSearch, employees]])
Answer: The first result returned is the highest salary as expected, which is 20000 in this case.
source = employees | sort salary | reverse | head 1 will return the item with salary = 20000. We delay head or limit pushdown until after reverse, ensuring the entire result set is available to reverse when appropriate, avoiding truncated or misleading results.
Open Question
- Should
reversepushdown be disabled or restricted when preceded bysortor multiple sorts to avoid partial results and incorrect output? - Should usage of
reversebe limited to queries bounded byheador similar limits? - Should a configuration flag exist to disable or warn about
reverseusage on large unbounded result sets?
Metadata
Metadata
Assignees
Labels
Type
Projects
Status