Skip to content

[SPARK-14317] [SQL] Cleanup hash join #12102

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 5 commits into from
Closed

Conversation

davies
Copy link
Contributor

@davies davies commented Mar 31, 2016

What changes were proposed in this pull request?

This PR did a few cleanup on HashedRelation and HashJoin:

  1. Merge HashedRelation and UniqueHashedRelation together
  2. Return an iterator from HashedRelation, so we donot need a create many UnsafeRow objects.
  3. Return a copy of HashedRelation for thread-safety in BroadcastJoin, so we can re-use the UnafeRow objects.
  4. Cleanup HashJoin, share most of the code between BroadcastHashJoin and ShuffleHashJoin
  5. Removed UniqueLongHashedRelation, which will be replaced by LongUnsafeMap (another PR).
  6. Update benchmark, before this patch, the selectivity of joins are too high.

How was this patch tested?

Existing tests.

@davies
Copy link
Contributor Author

davies commented Mar 31, 2016

@SparkQA
Copy link

SparkQA commented Apr 1, 2016

Test build #54674 has finished for PR 12102 at commit 52ed299.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Apr 1, 2016

Test build #54683 has finished for PR 12102 at commit 37724be.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Join w long codegen=true 275 / 352 76.2 13.1 19.4X
*/

runBenchmark("Join w long duplicated", N) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what does long duplicated mean? do you mean non-unique key?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes

@SparkQA
Copy link

SparkQA commented Apr 3, 2016

Test build #54795 has finished for PR 12102 at commit 0771d8b.

  • This patch fails Scala style tests.
  • This patch does not merge cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Apr 3, 2016

Test build #54796 has finished for PR 12102 at commit e519a41.

  • This patch passes all tests.
  • This patch does not merge cleanly.
  • This patch adds no public classes.

@rxin
Copy link
Contributor

rxin commented Apr 4, 2016

LGTM

Conflicts:
	sql/core/src/main/scala/org/apache/spark/sql/execution/joins/HashedRelation.scala
@davies
Copy link
Contributor Author

davies commented Apr 4, 2016

Merging this into master, thanks!

@asfgit asfgit closed this in 7454253 Apr 4, 2016
@SparkQA
Copy link

SparkQA commented Apr 4, 2016

Test build #54863 has finished for PR 12102 at commit 0bd71c4.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Apr 4, 2016

Test build #54864 has finished for PR 12102 at commit 9690304.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@davies davies changed the title [SPARK-14137] [SQL] Cleanup hash join [SPARK-14317] [SQL] Cleanup hash join Apr 6, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants