-
Notifications
You must be signed in to change notification settings - Fork 28.6k
[SPARK-3500] [SQL] use JavaSchemaRDD as SchemaRDD._jschema_rdd #2369
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
QA tests have started for PR 2369 at commit
|
QA tests have finished for PR 2369 at commit
|
This looks good to me. There's some ongoing discussion on the JIRA over whether this should be included in 1.1.1. |
I think this is clearly a bug, not a missing feature, since SchemaRDD instances expose a public method that always throws an exception when called. I'd like to merge this into |
Backported into |
|
||
srdd = srdd.coalesce(2, True) | ||
srdd = srdd.repartition(3) | ||
srdd = srdd.distinct() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@davies Shouldn't we also test srdd.distinct(n)
since that was the missing functionality documented in SPARK-3500?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fair point, although if srdd.distinct()
works then srdd.distinct(n)
should also work due to how distinct() and this fix were implemented.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, I'll take your word for it but just point out that one of the reported issues in SPARK-3500 was specifically that distinct()
worked but distinct(n)
didn't. Since that is a possible failure mode, it probably makes sense to have a test for each.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, it looks like we don't support distinct(n)
in PySpark (the original ticket dealt with distinct()
and coalesce()
simply not working). Let's open a separate JIRA for that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, a separate issue makes sense (I actually suggested that in the JIRA ticket). But to clarify, the ticket was originally about coalesce()
not working, then repartition()
and distinct(n)
were added on.
distinct()
with no parameters was always working. There was no question about that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
distinct(n) is a missing API, we could fix it in another issue or delay it later.
Currently, SchemaRDD._jschema_rdd is SchemaRDD, the Scala API (coalesce(), repartition()) can not been called in Python easily, there is no way to specify the implicit parameter
ord
. The _jrdd is an JavaRDD, so _jschema_rdd should also be JavaSchemaRDD.In this patch, change _schema_rdd to JavaSchemaRDD, also added an assert for it. If some methods are missing from JavaSchemaRDD, then it's called by _schema_rdd.baseSchemaRDD().xxx().
BTW, Do we need JavaSQLContext?