Skip to content

SPARK-1380: Add sort-merge based cogroup/joins. #283

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 2 commits into from

Conversation

ueshin
Copy link
Member

@ueshin ueshin commented Apr 1, 2014

I've written cogroup/joins based on 'Sort-Merge' algorithm.

@AmplabJenkins
Copy link

Merged build triggered.

@AmplabJenkins
Copy link

Merged build started.

@AmplabJenkins
Copy link

Merged build finished. All automated tests passed.

@AmplabJenkins
Copy link

All automated tests passed.
Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/13626/

@rxin
Copy link
Contributor

rxin commented Apr 1, 2014

Is there a specific use case you are trying to address that cannot be handled by the hash join?

@mridulm
Copy link
Contributor

mridulm commented Apr 2, 2014

I have not done a detailed review - but looks pretty expensive in terms of memory.
Is it making assumptions about lack of skew w.r.t a key and amount of data per partition (that it can be held entirely in memory) ?
Would be good to document what are the constraints of the solution.

@ueshin
Copy link
Member Author

ueshin commented Apr 3, 2014

@rxin Thank you for your reply.

There are some case to use merge join for optimization:

  1. If data to be joined are already sorted by join keys, merge join would be done more efficiently than hash join. In my test case, both algorithms were almost same speed, but merge join was scalable.
  2. Merge join for sorted data by the same keys would be pipelined (each output can be produced immediately for arrived input tuples) even if N-way join (N>2). Hash join blocks due to building a hash-table before output are produced.

I think it is useful for users to choose ways to optimize their processing.

@ueshin
Copy link
Member Author

ueshin commented Apr 3, 2014

@mridulm Thank you for your reply.

There are 2 points I have to mention about memory:

  • Before shuffle
    If data are sorted, no more memory is needed because no sort operation is needed, and if not sorted, merge join needs some amount of memory to sort data in each partition.
  • After shuffle
    Merge join needs at most the same amount of memory as hash join while fetching data, but it does not need more memory because it can produce output immediately from input. Hash join needs some more memory to build a hash table.

@nchammas
Copy link
Contributor

@pwendell @rxin @mateiz What is the status of this PR? It looks pretty substantial, but it hasn't been updated in a while.

@pwendell
Copy link
Contributor

I'd suggest we close this issue for now and go to the JIRA to discuss whether the feature is needed and how high of a priority it is.

@asfgit asfgit closed this in f73b56f Nov 10, 2014
lins05 pushed a commit to lins05/spark that referenced this pull request May 30, 2017
* Monitor pod status in submission v2.

* Address comments
erikerlandson pushed a commit to erikerlandson/spark that referenced this pull request Jul 28, 2017
* Monitor pod status in submission v2.

* Address comments
rahij pushed a commit to rahij/spark that referenced this pull request Dec 5, 2017
* upgrade hadoop to 2.9.0-palantir.1-rc9

* run test-dependencies.sh --replace-manifest

* missed one

* no more rc for deps

* and the poms

* fix the test

* bump to 2.9.0-palantir.2
rahij added a commit to rahij/spark that referenced this pull request Dec 5, 2017
gatesn pushed a commit to gatesn/spark that referenced this pull request Mar 14, 2018
* Revert "Bump Hadoop to 2.9.0-palantir.3 (apache#288)"

This reverts commit bb010b8.

* Revert "Hadoop 2.9.0-palantir.2 (apache#283)"

This reverts commit 65956b7.
bzhaoopenstack pushed a commit to bzhaoopenstack/spark that referenced this pull request Sep 11, 2019
arjunshroff pushed a commit to arjunshroff/spark that referenced this pull request Nov 24, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants