Hybrid scan operator for leveraging index alongside newly appended data - BucketUnion #151

sezruby · 2020-09-11T02:31:33Z

What changes were proposed in this pull request?

For full context on the "why" for this PR, please see the main issue: #150

Introducing a new BucketUnion operator that is useful for implementing hybrid scan, a technique that we propose that can leverage the index and appended data without the need to re-shuffle the index, thus preserving the benefits of the index.

As part of this, the following classes are being implemented:

BucketUnion: Logical Plan operator; Used during logical plan optimization when the newly appended data needs to be union-ed with the data being read from the index.
BucketUnionExec: SparkPlan (Physical operator); Calls into BucketUnionRDD
BucketUnionRDD: RDD operator
BucketUnionRDDPartition: Partition
BucketUnionStrategy: SparkStrategy that is used when the Logical Plan is being converted to SparkPlan; More specifically, converts BucketUnion to BucketUnionExec

Note: You can find more detailed information about Bucketing optimization in Bucketing 2.0: Improve Spark SQL Performance by Removing Shuffle

Why are the changes needed?

Spark does not support Union using PartitionSpecification, but just PartitionerAwareUnionRDD operation which does not retain outputPartitioning of result. Therefore, we define a new Union operation (being called the BucketUnion) which works when the following conditions are satisfied:

input RDDs must have the same number of partitions.
input RDDs must have the same partitioning keys.
input RDDs must have the same column schema.

Unfortunately, since there is no explicit API to check Partitioning keys in RDD, we have to assure that on the caller side. Therefore, BucketUnionRDD is Hyperspace internal use only.

BucketUnion can be used to merge index data & newly appended data without losing the bucketing specification (outputPartitioning).

Does this PR introduce any user-facing change?

Existing experience:
When the underlying data changes, Hyperspace decides to not use the index anymore.

New experience:
When the underlying data changes and hybrid scan is enabled, Hyperspace utilizes the index to the extent possible and performs a linear scan on the new data.

How was this patch tested?

BucketUnionTest

src/test/scala/com/microsoft/hyperspace/index/BucketUnionTest.scala

imback82

Few minor comments, but LGTM, thanks @sezruby!

src/test/scala/com/microsoft/hyperspace/index/BucketUnionTest.scala

imback82 · 2020-09-11T15:29:48Z

@apoorvedave1 @rapoth @pirz Can you review as well?

imback82 · 2020-09-11T15:30:55Z

@sezruby Could you update the title to be a bit more descriptive? (it's used as a commit message).

src/test/scala/com/microsoft/hyperspace/index/BucketUnionTest.scala

apoorvedave1

a few minor comments, otherwise LGTM, thanks @sezruby

src/test/scala/com/microsoft/hyperspace/index/BucketUnionTest.scala

pirz · 2020-09-11T22:38:20Z

src/test/scala/com/microsoft/hyperspace/index/BucketUnionTest.scala

+
+  test("BucketUnion require test") {
+    import spark.implicits._
+    val df1 = Seq((1, "name1"), (2, "name2")).toDF("id", "name")


Some of these DF definitions are repeated per test; Do you think it is possible to define them once at the class level and initialize them in beforeAll? (similar to partitionedDataDF and nonPartitionedDataDF in CreateIndexTests.scala).

I actually prefer the current way. The dfs are simple enough and it makes it easier to read in this scope; I don't have to go back and forth to remember how it was defined.

rapoth

Thanks a lot for opening a short and consise PR! I really appreciate it!

src/main/scala/com/microsoft/hyperspace/index/execution/BucketUnionExec.scala

src/main/scala/com/microsoft/hyperspace/index/plans/logical/BucketUnion.scala

src/test/scala/com/microsoft/hyperspace/index/BucketUnionTest.scala

Co-authored-by: Rahul Potharaju <rapoth@microsoft.com> Co-authored-by: Apoorve Dave <66283785+apoorvedave1@users.noreply.github.com>

Co-authored-by: Rahul Potharaju <rapoth@microsoft.com>

apoorvedave1

LGTM 👍 , thanks @sezruby

imback82

LGTM, thanks @sezruby!

Add BucketUnion operator

5e69032

sezruby mentioned this pull request Sep 11, 2020

Hybrid Scan for File/Partition Mutable Datasets #150

Closed

7 tasks

imback82 assigned sezruby Sep 11, 2020

imback82 added the enhancement New feature or request label Sep 11, 2020

imback82 added this to the 0.3.0 milestone Sep 11, 2020

Fix assert in BucketUnionExec

18bc7e5

imback82 reviewed Sep 11, 2020

View reviewed changes

Review commit

15c0a7e

rapoth modified the milestones: 0.3.0, 0.4.0 Sep 11, 2020

imback82 reviewed Sep 11, 2020

View reviewed changes

Merge branch 'master' into hybridscan_1bucket

90c0e8c

imback82 requested review from imback82, apoorvedave1, pirz and rapoth September 11, 2020 15:29

rapoth changed the title ~~Add BucketUnion~~ Hybrid scan operator for leveraging index alongside newly appended data - BucketUnion Sep 11, 2020

apoorvedave1 reviewed Sep 11, 2020

View reviewed changes

src/test/scala/com/microsoft/hyperspace/index/BucketUnionTest.scala Outdated Show resolved Hide resolved

apoorvedave1 reviewed Sep 11, 2020

View reviewed changes

src/test/scala/com/microsoft/hyperspace/index/BucketUnionTest.scala Outdated Show resolved Hide resolved

apoorvedave1 previously approved these changes Sep 11, 2020

View reviewed changes

pirz reviewed Sep 11, 2020

View reviewed changes

src/test/scala/com/microsoft/hyperspace/index/BucketUnionTest.scala Outdated Show resolved Hide resolved

pirz reviewed Sep 11, 2020

View reviewed changes

rapoth reviewed Sep 11, 2020

View reviewed changes

Apply suggestions from code review

e2595df

Co-authored-by: Rahul Potharaju <rapoth@microsoft.com> Co-authored-by: Apoorve Dave <66283785+apoorvedave1@users.noreply.github.com>

sezruby dismissed stale reviews from apoorvedave1 via e2595df September 13, 2020 02:57

sezruby and others added 2 commits September 13, 2020 11:58

Apply suggestions from code review

c1505c9

Co-authored-by: Rahul Potharaju <rapoth@microsoft.com>

Review commit

9830649

apoorvedave1 self-requested a review September 14, 2020 15:45

apoorvedave1 approved these changes Sep 14, 2020

View reviewed changes

imback82 approved these changes Sep 14, 2020

View reviewed changes

imback82 merged commit 1c3b020 into microsoft:master Sep 14, 2020

sezruby mentioned this pull request Sep 15, 2020

Modify logical plan to merge newly appended files and index data #165

Merged

sezruby deleted the hybridscan_1bucket branch September 17, 2020 13:29

Hybrid scan operator for leveraging index alongside newly appended data - BucketUnion #151

Hybrid scan operator for leveraging index alongside newly appended data - BucketUnion #151

Uh oh!

Conversation

sezruby commented Sep 11, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

imback82 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

imback82 commented Sep 11, 2020

Uh oh!

imback82 commented Sep 11, 2020

Uh oh!

Uh oh!

Uh oh!

apoorvedave1 left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

pirz Sep 11, 2020

Choose a reason for hiding this comment

Uh oh!

imback82 Sep 12, 2020

Choose a reason for hiding this comment

Uh oh!

rapoth left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

apoorvedave1 left a comment

Choose a reason for hiding this comment

Uh oh!

imback82 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

sezruby commented Sep 11, 2020 •

edited

Loading

apoorvedave1 left a comment •

edited

Loading