Tracking issue: push down computation in distributed query #1108

Rachelint · 2023-07-26T09:34:10Z

Describe This Problem

Now, we support the rough disrtibuted sql query by hooking in table scan level, that leading actual computation such as aggregated can't be pushed down...

So, I plan to refactor it, and support distributed query in plan level for pushing down more things.

Proposal

1. Background
The exist implementations can be divided into two ways:

Generate explicit distibuted logical plan, and generate distributed physical plan after, like Drios
No explicit distributed loigcal plan(can't do it because no schema info?), and generate distributed physical plan directly, like TiDB and Datafusion.

As I see, they are almost same, the more clear way is to have the explicit distributed logical plan but it is the problem about code organization.

The real problem is should we depend on datafusion to do this? If we do it ourself, it may be more controllable? But it may need to design the complete physical plan generating process.

I think we should try to reuse the logic in datafusion first.

2. General
Works can be broken down as following:

Generate distributed physical plan according to the original, I think we make it refering to TiDB.
Support querying by physical plan in RemoteEngine.

3. Two role of node in proposal
My proposal is designed as folliowing:

Scheduler node(responsible for invoking the query, dispatching sub query to executor node, and computing the final result).
Executor node(where sub table in, responsible for computing the sub result).

4. Process

Scheduler node generates the initial physical plan of partitioned table. In this initial physical plan, the TableScan node is just a placeholder(can't execute actually) with some information for generating later executable plan, so I name it UnresolvePartitionedScan.
Scheduler node traverses the initial physical plan, finds the sub plan can be pushed down, and generate the sub plans for remote executing(using the information in UnresolvePartitionedScan). The sub plans are unable to execute like UnresolvePartitionedScan before being sent to and be rewriting in the executor nodes, so I name them UnresolveSubScans.
Scheduler node sends the sub plans to executor nodes and wait result, and UnresolveSubScan is converted to ResolvePartitionedScan now.
Executor nodes receive the sub plans, and converts the UnresolveSubScan to ResolveSubScan using the carried information and catalog in local.
Executor nodes execute the converted sub plans and return the results.

Additional Context

No response

The text was updated successfully, but these errors were encountered:

## Rationale Part of #1108 New distributed query framework have been impl, we support aggregate push down in this pr. ## Detailed Changes + push down the aggregate node when resolving partitioned scan. + support to switch new/old distributed query through http. ## Test Plan Test by exist tests.

jiacai2050 · 2023-09-30T08:18:02Z

Initial version has been completed, more detailed optimizes will be tracked in new issue.

Rachelint added the feature New feature or request label Jul 26, 2023

Rachelint changed the title ~~Support distributed sql query in plan level~~ Tracking issue: push down computation in distributed query Jul 28, 2023

Rachelint added the tracking issue Issue tracks progress for something label Jul 28, 2023

Rachelint self-assigned this Jul 28, 2023

Rachelint pinned this issue Jul 28, 2023

Rachelint mentioned this issue Jul 28, 2023

Support remote scan in physical plan level #1112

Closed

4 tasks

jiacai2050 added A-analytic-engine Area: Analytic Engine A-query-engine Area: Query engine and removed A-analytic-engine Area: Analytic Engine labels Aug 2, 2023

Rachelint mentioned this issue Sep 27, 2023

feat: support aggr push down in distributed query #1232

Merged

jiacai2050 closed this as completed Sep 30, 2023

tanruixiang unpinned this issue Oct 23, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tracking issue: push down computation in distributed query #1108

Tracking issue: push down computation in distributed query #1108

Rachelint commented Jul 26, 2023 •

edited

Loading

jiacai2050 commented Sep 30, 2023

Tracking issue: push down computation in distributed query #1108

Tracking issue: push down computation in distributed query #1108

Comments

Rachelint commented Jul 26, 2023 • edited Loading

Describe This Problem

Proposal

Additional Context

jiacai2050 commented Sep 30, 2023

Rachelint commented Jul 26, 2023 •

edited

Loading