-
Notifications
You must be signed in to change notification settings - Fork 28.6k
[SPARK-6027][SPARK-5546] Fixed --jar and --packages not working for KafkaUtils and improved error message #4779
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Test build #27980 has started for PR 4779 at commit
|
Test build #27980 has finished for PR 4779 at commit
|
Test FAILed. |
Hi @tdas , a stupid question, why previous way of |
@@ -63,20 +63,32 @@ def createStream(ssc, zkQuorum, groupId, topics, kafkaParams={}, | |||
jparam = MapConverter().convert(kafkaParams, ssc.sparkContext._gateway._gateway_client) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remove java_import()
@jerryshao the previous way required --driver-class-path as well as --jars (or --driver-class-path and --executor-class-path). The existing message was incorrect, @davies and I missed it in the original PR because we probably never tested in distributed mode. Instead, --jars is supposed to be a single solution for adding any jar to both driver and executor. It works for Scala/Java but does not work correctly for Python, hence this workaround. |
Test build #27984 has started for PR 4779 at commit
|
LGTM, the helper class actually simplify things. |
Test build #27987 has started for PR 4779 at commit
|
Test build #27984 has finished for PR 4779 at commit
|
Test PASSed. |
Test build #27987 has finished for PR 4779 at commit
|
Test PASSed. |
* classOf[KafkaUtilsPythonHelper].newInstance(), and the createStream() | ||
* takes care of known parameters instead of passing them from Python | ||
*/ | ||
private[kafka] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does it still work if this is strictly private
? Just wondering
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure if I will be able to call from python if this is strictly private. I havent tested it though. I dont think that;s a big deal, as long as its hidden from public view
LGTM as discussed with @tdas offline we're gonna change it to |
…afkaUtils and improved error message The problem with SPARK-6027 in short is that JARs like the kafka-assembly.jar does not work in python as the added JAR is not visible in the classloader used by Py4J. Py4J uses Class.forName(), which does not uses the systemclassloader, but the JARs are only visible in the Thread's contextclassloader. So this back uses the context class loader to create the KafkaUtils dstream object. This works for both cases where the Kafka libraries are added with --jars spark-streaming-kafka-assembly.jar or with --packages spark-streaming-kafka Also improves the error message. davies Author: Tathagata Das <tathagata.das1565@gmail.com> Closes #4779 from tdas/kafka-python-fix and squashes the following commits: fb16b04 [Tathagata Das] Removed import c1fdf35 [Tathagata Das] Fixed long line and improved documentation 7b88be8 [Tathagata Das] Fixed --jar not working for KafkaUtils and improved error message (cherry picked from commit aa63f63) Signed-off-by: Andrew Or <andrew@databricks.com>
…oading issue This patch upgrades Py4J from 0.9.1 to 0.9.2 in order to include a patch which modifies Py4J to use the current thread's ContextClassLoader when performing reflection / class loading. This is necessary in order to fix [SPARK-5185](https://issues.apache.org/jira/browse/SPARK-5185), a longstanding issue affecting the use of `--jars` and `--packages` in PySpark. In order to demonstrate that the fix works, I removed the workarounds which were added as part of [SPARK-6027](https://issues.apache.org/jira/browse/SPARK-6027) / #4779 and other patches. Py4J diff: py4j/py4j@0.9.1...0.9.2 /cc zsxwing tdas davies brkyvz Author: Josh Rosen <joshrosen@databricks.com> Closes #11687 from JoshRosen/py4j-0.9.2.
…oading issue This patch upgrades Py4J from 0.9.1 to 0.9.2 in order to include a patch which modifies Py4J to use the current thread's ContextClassLoader when performing reflection / class loading. This is necessary in order to fix [SPARK-5185](https://issues.apache.org/jira/browse/SPARK-5185), a longstanding issue affecting the use of `--jars` and `--packages` in PySpark. In order to demonstrate that the fix works, I removed the workarounds which were added as part of [SPARK-6027](https://issues.apache.org/jira/browse/SPARK-6027) / apache#4779 and other patches. Py4J diff: py4j/py4j@0.9.1...0.9.2 /cc zsxwing tdas davies brkyvz Author: Josh Rosen <joshrosen@databricks.com> Closes apache#11687 from JoshRosen/py4j-0.9.2.
…oading issue This patch upgrades Py4J from 0.9.1 to 0.9.2 in order to include a patch which modifies Py4J to use the current thread's ContextClassLoader when performing reflection / class loading. This is necessary in order to fix [SPARK-5185](https://issues.apache.org/jira/browse/SPARK-5185), a longstanding issue affecting the use of `--jars` and `--packages` in PySpark. In order to demonstrate that the fix works, I removed the workarounds which were added as part of [SPARK-6027](https://issues.apache.org/jira/browse/SPARK-6027) / apache#4779 and other patches. Py4J diff: py4j/py4j@0.9.1...0.9.2 /cc zsxwing tdas davies brkyvz Author: Josh Rosen <joshrosen@databricks.com> Closes apache#11687 from JoshRosen/py4j-0.9.2.
The problem with SPARK-6027 in short is that JARs like the kafka-assembly.jar does not work in python as the added JAR is not visible in the classloader used by Py4J. Py4J uses Class.forName(), which does not uses the systemclassloader, but the JARs are only visible in the Thread's contextclassloader. So this back uses the context class loader to create the KafkaUtils dstream object. This works for both cases where the Kafka libraries are added with --jars spark-streaming-kafka-assembly.jar or with --packages spark-streaming-kafka
Also improves the error message.
@davies