Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tracking issue: push down computation in distributed query #1108

Closed
Rachelint opened this issue Jul 26, 2023 · 1 comment
Closed

Tracking issue: push down computation in distributed query #1108

Rachelint opened this issue Jul 26, 2023 · 1 comment
Assignees
Labels
A-query-engine Area: Query engine feature New feature or request tracking issue Issue tracks progress for something

Comments

@Rachelint
Copy link
Contributor

Rachelint commented Jul 26, 2023

Describe This Problem

Now, we support the rough disrtibuted sql query by hooking in table scan level, that leading actual computation such as aggregated can't be pushed down...

So, I plan to refactor it, and support distributed query in plan level for pushing down more things.

Proposal

1. Background
The exist implementations can be divided into two ways:

  • Generate explicit distibuted logical plan, and generate distributed physical plan after, like Drios
  • No explicit distributed loigcal plan(can't do it because no schema info?), and generate distributed physical plan directly, like TiDB and Datafusion.

As I see, they are almost same, the more clear way is to have the explicit distributed logical plan but it is the problem about code organization.

The real problem is should we depend on datafusion to do this? If we do it ourself, it may be more controllable? But it may need to design the complete physical plan generating process.

I think we should try to reuse the logic in datafusion first.

2. General
Works can be broken down as following:

  • Generate distributed physical plan according to the original, I think we make it refering to TiDB.
  • Support querying by physical plan in RemoteEngine.

3. Two role of node in proposal
My proposal is designed as folliowing:

  • Scheduler node(responsible for invoking the query, dispatching sub query to executor node, and computing the final result).
  • Executor node(where sub table in, responsible for computing the sub result).

4. Process

  • Scheduler node generates the initial physical plan of partitioned table. In this initial physical plan, the TableScan node is just a placeholder(can't execute actually) with some information for generating later executable plan, so I name it UnresolvePartitionedScan.
  • Scheduler node traverses the initial physical plan, finds the sub plan can be pushed down, and generate the sub plans for remote executing(using the information in UnresolvePartitionedScan). The sub plans are unable to execute like UnresolvePartitionedScan before being sent to and be rewriting in the executor nodes, so I name them UnresolveSubScans.
  • Scheduler node sends the sub plans to executor nodes and wait result, and UnresolveSubScan is converted to ResolvePartitionedScan now.
  • Executor nodes receive the sub plans, and converts the UnresolveSubScan to ResolveSubScan using the carried information and catalog in local.
  • Executor nodes execute the converted sub plans and return the results.

aaa

Additional Context

No response

@Rachelint Rachelint added the feature New feature or request label Jul 26, 2023
@Rachelint Rachelint changed the title Support distributed sql query in plan level Tracking issue: push down computation in distributed query Jul 28, 2023
@Rachelint Rachelint added the tracking issue Issue tracks progress for something label Jul 28, 2023
@Rachelint Rachelint self-assigned this Jul 28, 2023
@Rachelint Rachelint pinned this issue Jul 28, 2023
@jiacai2050 jiacai2050 added A-analytic-engine Area: Analytic Engine A-query-engine Area: Query engine and removed A-analytic-engine Area: Analytic Engine labels Aug 2, 2023
jiacai2050 pushed a commit that referenced this issue Sep 30, 2023
## Rationale
Part of #1108 
New distributed query framework have been impl, we support aggregate
push down in this pr.

## Detailed Changes
+ push down the aggregate node when resolving partitioned scan.
+ support to switch new/old distributed query through http.

## Test Plan
Test by exist tests.
@jiacai2050
Copy link
Contributor

Initial version has been completed, more detailed optimizes will be tracked in new issue.

@tanruixiang tanruixiang unpinned this issue Oct 23, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-query-engine Area: Query engine feature New feature or request tracking issue Issue tracks progress for something
Projects
None yet
Development

No branches or pull requests

2 participants