[SPARK-49919][SQL] Add special limits support for return content as JSON dataset #48407
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
What changes were proposed in this pull request?
CollectLimitExec
is used when a logicalLimit
and/orOffset
operation is the final operator. Comparing toGlobalLimitExec
, it can avoid shuffle data to a single output partition.But when the dataset is collected as a Dataset of JSON strings. The
GlobalLimitExec
andTakeOrderedAndProjectExec
are not able to applied since theSpecialLimits
strategy cannot work as expected.Here is an example, following query is a simple select-limit query:
When we add
toJSON
method, the plan changed unexpected and introduced a shuffle.Why are the changes needed?
Without this patching, the simple query "select limit" or "select sort limit" has to introduce a shuffle when return content as JSON dataset. Both
CollectLimitExec
andTakeOrderedAndProject
cannot be applied.Dataset.toJSON
as a fundamental API causes to poor performance in many scenarios.Does this PR introduce any user-facing change?
No
How was this patch tested?
UT added.
Was this patch authored or co-authored using generative AI tooling?
No