SPARK-1380: Add sort-merge based cogroup/joins. #283

ueshin · 2014-04-01T08:40:15Z

I've written cogroup/joins based on 'Sort-Merge' algorithm.

AmplabJenkins · 2014-04-01T08:42:23Z

Merged build triggered.

AmplabJenkins · 2014-04-01T08:42:33Z

Merged build started.

AmplabJenkins · 2014-04-01T09:39:06Z

Merged build finished. All automated tests passed.

AmplabJenkins · 2014-04-01T09:39:06Z

All automated tests passed.
Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/13626/

rxin · 2014-04-01T19:08:46Z

Is there a specific use case you are trying to address that cannot be handled by the hash join?

mridulm · 2014-04-02T03:44:17Z

I have not done a detailed review - but looks pretty expensive in terms of memory.
Is it making assumptions about lack of skew w.r.t a key and amount of data per partition (that it can be held entirely in memory) ?
Would be good to document what are the constraints of the solution.

ueshin · 2014-04-03T06:43:56Z

@rxin Thank you for your reply.

There are some case to use merge join for optimization:

If data to be joined are already sorted by join keys, merge join would be done more efficiently than hash join. In my test case, both algorithms were almost same speed, but merge join was scalable.
Merge join for sorted data by the same keys would be pipelined (each output can be produced immediately for arrived input tuples) even if N-way join (N>2). Hash join blocks due to building a hash-table before output are produced.

I think it is useful for users to choose ways to optimize their processing.

ueshin · 2014-04-03T07:39:28Z

@mridulm Thank you for your reply.

There are 2 points I have to mention about memory:

Before shuffle
If data are sorted, no more memory is needed because no sort operation is needed, and if not sorted, merge join needs some amount of memory to sort data in each partition.
After shuffle
Merge join needs at most the same amount of memory as hash join while fetching data, but it does not need more memory because it can produce output immediately from input. Hash join needs some more memory to build a hash table.

nchammas · 2014-09-23T07:17:04Z

@pwendell @rxin @mateiz What is the status of this PR? It looks pretty substantial, but it hasn't been updated in a while.

pwendell · 2014-11-10T02:09:52Z

I'd suggest we close this issue for now and go to the JIRA to discuss whether the feature is needed and how high of a priority it is.

* Monitor pod status in submission v2. * Address comments

* upgrade hadoop to 2.9.0-palantir.1-rc9 * run test-dependencies.sh --replace-manifest * missed one * no more rc for deps * and the poms * fix the test * bump to 2.9.0-palantir.2

This reverts commit 65956b7.

* Revert "Bump Hadoop to 2.9.0-palantir.3 (apache#288)" This reverts commit bb010b8. * Revert "Hadoop 2.9.0-palantir.2 (apache#283)" This reverts commit 65956b7.

Fix ansible testing fails

…he#283)

ueshin added 2 commits April 1, 2014 16:38

Add sort-merge cogroup/joins.

1c8ba5a

Add Java APIs for sort-merge cogroup/joins.

9975166

asfgit closed this in f73b56f Nov 10, 2014

lins05 pushed a commit to lins05/spark that referenced this pull request May 30, 2017

Monitor pod status in submission v2. (apache#283)

408c65f

* Monitor pod status in submission v2. * Address comments

erikerlandson pushed a commit to erikerlandson/spark that referenced this pull request Jul 28, 2017

Monitor pod status in submission v2. (apache#283)

2e5f2cd

* Monitor pod status in submission v2. * Address comments

rahij added a commit to rahij/spark that referenced this pull request Dec 5, 2017

Revert "Hadoop 2.9.0-palantir.2 (apache#283)"

e3657bc

This reverts commit 65956b7.

bzhaoopenstack pushed a commit to bzhaoopenstack/spark that referenced this pull request Sep 11, 2019

Merge pull request apache#283 from theopenlab/fix-ansible-testing

6f606f1

Fix ansible testing fails

wangyum mentioned this pull request Aug 19, 2020

[SPARK-32444][SQL] Infer filters from DPP #29243

Closed

arjunshroff pushed a commit to arjunshroff/spark that referenced this pull request Nov 24, 2020

Run Spark job with specify the username for driver and executor (apac…

1654a42

…he#283)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

SPARK-1380: Add sort-merge based cogroup/joins. #283

SPARK-1380: Add sort-merge based cogroup/joins. #283

Uh oh!

ueshin commented Apr 1, 2014

Uh oh!

AmplabJenkins commented Apr 1, 2014

Uh oh!

AmplabJenkins commented Apr 1, 2014

Uh oh!

AmplabJenkins commented Apr 1, 2014

Uh oh!

AmplabJenkins commented Apr 1, 2014

Uh oh!

rxin commented Apr 1, 2014

Uh oh!

mridulm commented Apr 2, 2014

Uh oh!

ueshin commented Apr 3, 2014

Uh oh!

ueshin commented Apr 3, 2014

Uh oh!

nchammas commented Sep 23, 2014

Uh oh!

pwendell commented Nov 10, 2014

Uh oh!

Uh oh!

SPARK-1380: Add sort-merge based cogroup/joins. #283

SPARK-1380: Add sort-merge based cogroup/joins. #283

Uh oh!

Conversation

ueshin commented Apr 1, 2014

Uh oh!

AmplabJenkins commented Apr 1, 2014

Uh oh!

AmplabJenkins commented Apr 1, 2014

Uh oh!

AmplabJenkins commented Apr 1, 2014

Uh oh!

AmplabJenkins commented Apr 1, 2014

Uh oh!

rxin commented Apr 1, 2014

Uh oh!

mridulm commented Apr 2, 2014

Uh oh!

ueshin commented Apr 3, 2014

Uh oh!

ueshin commented Apr 3, 2014

Uh oh!

nchammas commented Sep 23, 2014

Uh oh!

pwendell commented Nov 10, 2014

Uh oh!

Uh oh!