Skip to content

[SPARK-3500] [SQL] use JavaSchemaRDD as SchemaRDD._jschema_rdd #2369

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from

Conversation

davies
Copy link
Contributor

@davies davies commented Sep 12, 2014

Currently, SchemaRDD._jschema_rdd is SchemaRDD, the Scala API (coalesce(), repartition()) can not been called in Python easily, there is no way to specify the implicit parameter ord. The _jrdd is an JavaRDD, so _jschema_rdd should also be JavaSchemaRDD.

In this patch, change _schema_rdd to JavaSchemaRDD, also added an assert for it. If some methods are missing from JavaSchemaRDD, then it's called by _schema_rdd.baseSchemaRDD().xxx().

BTW, Do we need JavaSQLContext?

@SparkQA
Copy link

SparkQA commented Sep 12, 2014

QA tests have started for PR 2369 at commit abee159.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Sep 12, 2014

QA tests have finished for PR 2369 at commit abee159.

  • This patch passes unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@JoshRosen
Copy link
Contributor

This looks good to me. There's some ongoing discussion on the JIRA over whether this should be included in 1.1.1.

@JoshRosen
Copy link
Contributor

I think this is clearly a bug, not a missing feature, since SchemaRDD instances expose a public method that always throws an exception when called. I'd like to merge this into master and branch-1.1.

@asfgit asfgit closed this in 885d162 Sep 13, 2014
@JoshRosen
Copy link
Contributor

Backported into branch-1.1 (a couple of minor merge conflicts, but only in tests.py; I fixed them by hand).


srdd = srdd.coalesce(2, True)
srdd = srdd.repartition(3)
srdd = srdd.distinct()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@davies Shouldn't we also test srdd.distinct(n) since that was the missing functionality documented in SPARK-3500?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair point, although if srdd.distinct() works then srdd.distinct(n) should also work due to how distinct() and this fix were implemented.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I'll take your word for it but just point out that one of the reported issues in SPARK-3500 was specifically that distinct() worked but distinct(n) didn't. Since that is a possible failure mode, it probably makes sense to have a test for each.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, it looks like we don't support distinct(n) in PySpark (the original ticket dealt with distinct() and coalesce() simply not working). Let's open a separate JIRA for that.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, a separate issue makes sense (I actually suggested that in the JIRA ticket). But to clarify, the ticket was originally about coalesce() not working, then repartition() and distinct(n) were added on.

distinct() with no parameters was always working. There was no question about that.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

distinct(n) is a missing API, we could fix it in another issue or delay it later.

@davies davies deleted the fix_schemardd branch September 15, 2014 22:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants