Skip to content
This repository was archived by the owner on Jun 14, 2024. It is now read-only.

Conversation

sezruby
Copy link
Collaborator

@sezruby sezruby commented Sep 11, 2020

What changes were proposed in this pull request?

For full context on the "why" for this PR, please see the main issue: #150

Introducing a new BucketUnion operator that is useful for implementing hybrid scan, a technique that we propose that can leverage the index and appended data without the need to re-shuffle the index, thus preserving the benefits of the index.

As part of this, the following classes are being implemented:

  1. BucketUnion: Logical Plan operator; Used during logical plan optimization when the newly appended data needs to be union-ed with the data being read from the index.
  2. BucketUnionExec: SparkPlan (Physical operator); Calls into BucketUnionRDD
  3. BucketUnionRDD: RDD operator
  4. BucketUnionRDDPartition: Partition
  5. BucketUnionStrategy: SparkStrategy that is used when the Logical Plan is being converted to SparkPlan; More specifically, converts BucketUnion to BucketUnionExec

Note: You can find more detailed information about Bucketing optimization in Bucketing 2.0: Improve Spark SQL Performance by Removing Shuffle

Why are the changes needed?

Spark does not support Union using PartitionSpecification, but just PartitionerAwareUnionRDD operation which does not retain outputPartitioning of result. Therefore, we define a new Union operation (being called the BucketUnion) which works when the following conditions are satisfied:

  • input RDDs must have the same number of partitions.
  • input RDDs must have the same partitioning keys.
  • input RDDs must have the same column schema.

Unfortunately, since there is no explicit API to check Partitioning keys in RDD, we have to assure that on the caller side. Therefore, BucketUnionRDD is Hyperspace internal use only.

BucketUnion can be used to merge index data & newly appended data without losing the bucketing specification (outputPartitioning).

Does this PR introduce any user-facing change?

Existing experience:
When the underlying data changes, Hyperspace decides to not use the index anymore.

New experience:
When the underlying data changes and hybrid scan is enabled, Hyperspace utilizes the index to the extent possible and performs a linear scan on the new data.

How was this patch tested?

BucketUnionTest

@imback82 imback82 added the enhancement New feature or request label Sep 11, 2020
@imback82 imback82 added this to the 0.3.0 milestone Sep 11, 2020
@rapoth rapoth modified the milestones: 0.3.0, 0.4.0 Sep 11, 2020
Copy link
Contributor

@imback82 imback82 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Few minor comments, but LGTM, thanks @sezruby!

@imback82
Copy link
Contributor

@apoorvedave1 @rapoth @pirz Can you review as well?

@imback82
Copy link
Contributor

@sezruby Could you update the title to be a bit more descriptive? (it's used as a commit message).

@rapoth rapoth changed the title Add BucketUnion Hybrid scan operator for leveraging index alongside newly appended data - BucketUnion Sep 11, 2020
apoorvedave1
apoorvedave1 previously approved these changes Sep 11, 2020
Copy link
Contributor

@apoorvedave1 apoorvedave1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a few minor comments, otherwise LGTM, thanks @sezruby


test("BucketUnion require test") {
import spark.implicits._
val df1 = Seq((1, "name1"), (2, "name2")).toDF("id", "name")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some of these DF definitions are repeated per test; Do you think it is possible to define them once at the class level and initialize them in beforeAll? (similar to partitionedDataDF and nonPartitionedDataDF in CreateIndexTests.scala).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I actually prefer the current way. The dfs are simple enough and it makes it easier to read in this scope; I don't have to go back and forth to remember how it was defined.

Copy link
Contributor

@rapoth rapoth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot for opening a short and consise PR! I really appreciate it!

Co-authored-by: Rahul Potharaju <rapoth@microsoft.com>
Co-authored-by: Apoorve Dave <66283785+apoorvedave1@users.noreply.github.com>
@sezruby sezruby dismissed stale reviews from apoorvedave1 via e2595df September 13, 2020 02:57
sezruby and others added 2 commits September 13, 2020 11:58
Co-authored-by: Rahul Potharaju <rapoth@microsoft.com>
@apoorvedave1 apoorvedave1 self-requested a review September 14, 2020 15:45
Copy link
Contributor

@apoorvedave1 apoorvedave1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM 👍 , thanks @sezruby

Copy link
Contributor

@imback82 imback82 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks @sezruby!

@imback82 imback82 merged commit 1c3b020 into microsoft:master Sep 14, 2020
@sezruby sezruby deleted the hybridscan_1bucket branch September 17, 2020 13:29
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants