Skip to content

[SPARK-6027][SPARK-5546] Fixed --jar and --packages not working for KafkaUtils and improved error message #4779

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 3 commits into from

Conversation

tdas
Copy link
Contributor

@tdas tdas commented Feb 26, 2015

The problem with SPARK-6027 in short is that JARs like the kafka-assembly.jar does not work in python as the added JAR is not visible in the classloader used by Py4J. Py4J uses Class.forName(), which does not uses the systemclassloader, but the JARs are only visible in the Thread's contextclassloader. So this back uses the context class loader to create the KafkaUtils dstream object. This works for both cases where the Kafka libraries are added with --jars spark-streaming-kafka-assembly.jar or with --packages spark-streaming-kafka

Also improves the error message.

@davies

@SparkQA
Copy link

SparkQA commented Feb 26, 2015

Test build #27980 has started for PR 4779 at commit 7b88be8.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Feb 26, 2015

Test build #27980 has finished for PR 4779 at commit 7b88be8.

  • This patch fails Python style tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • class KafkaUtilsPythonHelper

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/27980/
Test FAILed.

@jerryshao
Copy link
Contributor

Hi @tdas , a stupid question, why previous way of --driver-class-path is not good, since if we use --jars, we have to create every helper function for Python, I'm a little confused.

@@ -63,20 +63,32 @@ def createStream(ssc, zkQuorum, groupId, topics, kafkaParams={},
jparam = MapConverter().convert(kafkaParams, ssc.sparkContext._gateway._gateway_client)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove java_import()

@tdas
Copy link
Contributor Author

tdas commented Feb 26, 2015

@jerryshao the previous way required --driver-class-path as well as --jars (or --driver-class-path and --executor-class-path). The existing message was incorrect, @davies and I missed it in the original PR because we probably never tested in distributed mode. Instead, --jars is supposed to be a single solution for adding any jar to both driver and executor. It works for Scala/Java but does not work correctly for Python, hence this workaround.

@SparkQA
Copy link

SparkQA commented Feb 26, 2015

Test build #27984 has started for PR 4779 at commit c1fdf35.

  • This patch merges cleanly.

@davies
Copy link
Contributor

davies commented Feb 26, 2015

LGTM, the helper class actually simplify things.

@SparkQA
Copy link

SparkQA commented Feb 26, 2015

Test build #27987 has started for PR 4779 at commit fb16b04.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Feb 26, 2015

Test build #27984 has finished for PR 4779 at commit c1fdf35.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • class KafkaUtilsPythonHelper

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/27984/
Test PASSed.

@SparkQA
Copy link

SparkQA commented Feb 26, 2015

Test build #27987 has finished for PR 4779 at commit fb16b04.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • class KafkaUtilsPythonHelper

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/27987/
Test PASSed.

* classOf[KafkaUtilsPythonHelper].newInstance(), and the createStream()
* takes care of known parameters instead of passing them from Python
*/
private[kafka]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does it still work if this is strictly private? Just wondering

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure if I will be able to call from python if this is strictly private. I havent tested it though. I dont think that;s a big deal, as long as its hidden from public view

@tdas tdas changed the title [SPARK-6027][SPARK-5546] Fixed --jar not working for KafkaUtils and improved error message [SPARK-6027][SPARK-5546] Fixed --jar and --packages not working for KafkaUtils and improved error message Feb 26, 2015
@andrewor14
Copy link
Contributor

LGTM as discussed with @tdas offline we're gonna change it to private. This is going into master and 1.3.

@asfgit asfgit closed this in aa63f63 Feb 26, 2015
asfgit pushed a commit that referenced this pull request Feb 26, 2015
…afkaUtils and improved error message

The problem with SPARK-6027 in short is that JARs like the kafka-assembly.jar does not work in python as the added JAR is not visible in the classloader used by Py4J. Py4J uses Class.forName(), which does not uses the systemclassloader, but the JARs are only visible in the Thread's contextclassloader. So this back uses the context class loader to create the KafkaUtils dstream object. This works for both cases where the Kafka libraries are added with --jars spark-streaming-kafka-assembly.jar or with --packages spark-streaming-kafka

Also improves the error message.

davies

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #4779 from tdas/kafka-python-fix and squashes the following commits:

fb16b04 [Tathagata Das] Removed import
c1fdf35 [Tathagata Das] Fixed long line and improved documentation
7b88be8 [Tathagata Das] Fixed --jar not working for KafkaUtils and improved error message

(cherry picked from commit aa63f63)
Signed-off-by: Andrew Or <andrew@databricks.com>
asfgit pushed a commit that referenced this pull request Mar 14, 2016
…oading issue

This patch upgrades Py4J from 0.9.1 to 0.9.2 in order to include a patch which modifies Py4J to use the current thread's ContextClassLoader when performing reflection / class loading. This is necessary in order to fix [SPARK-5185](https://issues.apache.org/jira/browse/SPARK-5185), a longstanding issue affecting the use of `--jars` and `--packages` in PySpark.

In order to demonstrate that the fix works, I removed the workarounds which were added as part of [SPARK-6027](https://issues.apache.org/jira/browse/SPARK-6027) / #4779 and other patches.

Py4J diff: py4j/py4j@0.9.1...0.9.2

/cc zsxwing tdas davies brkyvz

Author: Josh Rosen <joshrosen@databricks.com>

Closes #11687 from JoshRosen/py4j-0.9.2.
jeanlyn pushed a commit to jeanlyn/spark that referenced this pull request Mar 17, 2016
…oading issue

This patch upgrades Py4J from 0.9.1 to 0.9.2 in order to include a patch which modifies Py4J to use the current thread's ContextClassLoader when performing reflection / class loading. This is necessary in order to fix [SPARK-5185](https://issues.apache.org/jira/browse/SPARK-5185), a longstanding issue affecting the use of `--jars` and `--packages` in PySpark.

In order to demonstrate that the fix works, I removed the workarounds which were added as part of [SPARK-6027](https://issues.apache.org/jira/browse/SPARK-6027) / apache#4779 and other patches.

Py4J diff: py4j/py4j@0.9.1...0.9.2

/cc zsxwing tdas davies brkyvz

Author: Josh Rosen <joshrosen@databricks.com>

Closes apache#11687 from JoshRosen/py4j-0.9.2.
roygao94 pushed a commit to roygao94/spark that referenced this pull request Mar 22, 2016
…oading issue

This patch upgrades Py4J from 0.9.1 to 0.9.2 in order to include a patch which modifies Py4J to use the current thread's ContextClassLoader when performing reflection / class loading. This is necessary in order to fix [SPARK-5185](https://issues.apache.org/jira/browse/SPARK-5185), a longstanding issue affecting the use of `--jars` and `--packages` in PySpark.

In order to demonstrate that the fix works, I removed the workarounds which were added as part of [SPARK-6027](https://issues.apache.org/jira/browse/SPARK-6027) / apache#4779 and other patches.

Py4J diff: py4j/py4j@0.9.1...0.9.2

/cc zsxwing tdas davies brkyvz

Author: Josh Rosen <joshrosen@databricks.com>

Closes apache#11687 from JoshRosen/py4j-0.9.2.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants