Skip to content

[SPARK-5979][SPARK-6031][SPARK-6032][SPARK-6047] Refactoring for --packages -> Move to SparkSubmitDriverBootstrapper #4754

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 18 commits into from

Conversation

brkyvz
Copy link
Contributor

@brkyvz brkyvz commented Feb 25, 2015

This PR is an umbrella PR for 3 JIRAs. Here're the explanations:

  • SPARK-5979: All dependencies with the groupId org.apache.spark passed through --packages, were being excluded from the dependency tree on the assumption that they would be in the assembly jar. This is not the case, therefore the exclusion rules had to be defined more explicitly.
  • SPARK-6031: When using pyspark or running a python program through spark-submit, py4j was not picking up the dynamically loaded jars on the driver. Moving the code for --packages to SparkSubmitDriverBootstrapper solves this. However, this issue still remains for --jars.
  • SPARK-6032: Ivy prints a whole lot of logs while retrieving dependencies. These were printed to System.out. Moved the logging to System.err.

@tdas Would you care to try this? I think it should solve your problem

@SparkQA
Copy link

SparkQA commented Feb 25, 2015

Test build #27927 has finished for PR 4754 at commit 941c65e.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • class CheckAnalysis

@tdas
Copy link
Contributor

tdas commented Feb 25, 2015

@pwendell Please take a look at this. I think you reviewed the original PR of this feature.

@pwendell
Copy link
Contributor

LGTM

@brkyvz
Copy link
Contributor Author

brkyvz commented Feb 25, 2015

@tdas added a hack to include the jars on --driver-extra-classpath. Can you try your test now?

@SparkQA
Copy link

SparkQA commented Feb 25, 2015

Test build #27942 has finished for PR 4754 at commit e3ca1b7.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Feb 25, 2015

Test build #27943 has finished for PR 4754 at commit 5191f3a.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@tdas
Copy link
Contributor

tdas commented Feb 25, 2015

I tested. Still not working. I enabled verbose logging on spark-submit and saw this

[tdas @ Zion spark2] bin/spark-submit --verbose --master local[4] --repositories https://repository.apache.org/content/repositories/orgapachespark-1069/ --packages org.apache.spark:spark-streaming-kafka_2.10:1.3.0 examples/src/main/python/streaming/kafka_wordcount.py localhost:2181 test
Using properties file: null
Parsed arguments:
  master                  local[4]
  deployMode              null
  executorMemory          null
  executorCores           null
  totalExecutorCores      null
  propertiesFile          null
  driverMemory            null
  driverCores             null
  driverExtraClassPath    null
  driverExtraLibraryPath  null
  driverExtraJavaOptions  null
  supervise               false
  queue                   null
  numExecutors            null
  files                   null
  pyFiles                 null
  archives                null
  mainClass               null
  primaryResource         file:/Users/tdas/Projects/Spark/spark2/examples/src/main/python/streaming/kafka_wordcount.py
  name                    kafka_wordcount.py
  childArgs               [localhost:2181 test]
  jars                    null
  packages                org.apache.spark:spark-streaming-kafka_2.10:1.3.0
  repositories            https://repository.apache.org/content/repositories/orgapachespark-1069/
  verbose                 true

Spark properties used, including those specified through
 --conf and those from the properties file null:



Ivy Default Cache set to: /Users/tdas/.ivy2/cache
The jars for the packages stored in: /Users/tdas/.ivy2/jars
https://repository.apache.org/content/repositories/orgapachespark-1069/ added as a remote repository with the name: repo-1
:: loading settings :: url = jar:file:/Users/tdas/Projects/Spark/spark2/assembly/target/scala-2.10/spark-assembly-1.3.0-SNAPSHOT-hadoop1.0.4.jar!/org/apache/ivy/core/settings/ivysettings.xml
org.apache.spark#spark-streaming-kafka_2.10 added as a dependency
:: resolving dependencies :: org.apache.spark#spark-submit-parent;1.0
    confs: [default]
    found org.apache.spark#spark-streaming-kafka_2.10;1.3.0 in repo-1
    found org.apache.kafka#kafka_2.10;0.8.1.1 in list
    found com.yammer.metrics#metrics-core;2.2.0 in list
    found org.slf4j#slf4j-api;1.7.10 in list
    found org.xerial.snappy#snappy-java;1.1.1.6 in list
    found com.101tec#zkclient;0.3 in list
    found log4j#log4j;1.2.17 in list
    found org.spark-project.spark#unused;1.0.0 in list
:: resolution report :: resolve 370ms :: artifacts dl 17ms
    :: modules in use:
    com.101tec#zkclient;0.3 from list in [default]
    com.yammer.metrics#metrics-core;2.2.0 from list in [default]
    log4j#log4j;1.2.17 from list in [default]
    org.apache.kafka#kafka_2.10;0.8.1.1 from list in [default]
    org.apache.spark#spark-streaming-kafka_2.10;1.3.0 from repo-1 in [default]
    org.slf4j#slf4j-api;1.7.10 from list in [default]
    org.spark-project.spark#unused;1.0.0 from list in [default]
    org.xerial.snappy#snappy-java;1.1.1.6 from list in [default]
    ---------------------------------------------------------------------
    |                  |            modules            ||   artifacts   |
    |       conf       | number| search|dwnlded|evicted|| number|dwnlded|
    ---------------------------------------------------------------------
    |      default     |   8   |   0   |   0   |   0   ||   8   |   0   |
    ---------------------------------------------------------------------
:: retrieving :: org.apache.spark#spark-submit-parent
    confs: [default]
    0 artifacts copied, 8 already retrieved (0kB/7ms)
Main class:
org.apache.spark.deploy.PythonRunner
Arguments:
file:/Users/tdas/Projects/Spark/spark2/examples/src/main/python/streaming/kafka_wordcount.py
/Users/tdas/.ivy2/jars/spark-streaming-kafka_2.10.jar,/Users/tdas/.ivy2/jars/kafka_2.10.jar,/Users/tdas/.ivy2/jars/unused.jar,/Users/tdas/.ivy2/jars/metrics-core.jar,/Users/tdas/.ivy2/jars/snappy-java.jar,/Users/tdas/.ivy2/jars/zkclient.jar,/Users/tdas/.ivy2/jars/slf4j-api.jar,/Users/tdas/.ivy2/jars/log4j.jar
localhost:2181
test
System properties:
SPARK_SUBMIT -> true
spark.submit.pyFiles -> /Users/tdas/.ivy2/jars/spark-streaming-kafka_2.10.jar,/Users/tdas/.ivy2/jars/kafka_2.10.jar,/Users/tdas/.ivy2/jars/unused.jar,/Users/tdas/.ivy2/jars/metrics-core.jar,/Users/tdas/.ivy2/jars/snappy-java.jar,/Users/tdas/.ivy2/jars/zkclient.jar,/Users/tdas/.ivy2/jars/slf4j-api.jar,/Users/tdas/.ivy2/jars/log4j.jar
spark.files -> file:/Users/tdas/Projects/Spark/spark2/examples/src/main/python/streaming/kafka_wordcount.py,file:/Users/tdas/.ivy2/jars/spark-streaming-kafka_2.10.jar,file:/Users/tdas/.ivy2/jars/kafka_2.10.jar,file:/Users/tdas/.ivy2/jars/unused.jar,file:/Users/tdas/.ivy2/jars/metrics-core.jar,file:/Users/tdas/.ivy2/jars/snappy-java.jar,file:/Users/tdas/.ivy2/jars/zkclient.jar,file:/Users/tdas/.ivy2/jars/slf4j-api.jar,file:/Users/tdas/.ivy2/jars/log4j.jar
spark.app.name -> kafka_wordcount.py
spark.jars -> file:/Users/tdas/.ivy2/jars/spark-streaming-kafka_2.10.jar,file:/Users/tdas/.ivy2/jars/kafka_2.10.jar,file:/Users/tdas/.ivy2/jars/unused.jar,file:/Users/tdas/.ivy2/jars/metrics-core.jar,file:/Users/tdas/.ivy2/jars/snappy-java.jar,file:/Users/tdas/.ivy2/jars/zkclient.jar,file:/Users/tdas/.ivy2/jars/slf4j-api.jar,file:/Users/tdas/.ivy2/jars/log4j.jar
spark.master -> local[4]
spark.driver.extraClassPath -> /Users/tdas/.ivy2/jars/spark-streaming-kafka_2.10.jar,/Users/tdas/.ivy2/jars/kafka_2.10.jar,/Users/tdas/.ivy2/jars/unused.jar,/Users/tdas/.ivy2/jars/metrics-core.jar,/Users/tdas/.ivy2/jars/snappy-java.jar,/Users/tdas/.ivy2/jars/zkclient.jar,/Users/tdas/.ivy2/jars/slf4j-api.jar,/Users/tdas/.ivy2/jars/log4j.jar
Classpath elements:
/Users/tdas/.ivy2/jars/spark-streaming-kafka_2.10.jar
/Users/tdas/.ivy2/jars/kafka_2.10.jar
/Users/tdas/.ivy2/jars/unused.jar
/Users/tdas/.ivy2/jars/metrics-core.jar
/Users/tdas/.ivy2/jars/snappy-java.jar
/Users/tdas/.ivy2/jars/zkclient.jar
/Users/tdas/.ivy2/jars/slf4j-api.jar
/Users/tdas/.ivy2/jars/log4j.jar

So i can see that the relevant jars are being added to the classpath elements but pyspark is still unable to find org.apache.spark.streaming.kafka.KafkaUtils (from /Users/tdas/.ivy2/jars/spark-streaming-kafka_2.10.jar).

Lets debug this tomorrow morning.

@tdas
Copy link
Contributor

tdas commented Feb 25, 2015

No I verified the class does exist in jar
On Feb 25, 2015 10:07 AM, "Burak Yavuz" notifications@github.com wrote:

nvm, it should be in spark-streaming-kafka_2.10.jar


Reply to this email directly or view it on GitHub
#4754 (comment).

@SparkQA
Copy link

SparkQA commented Feb 25, 2015

Test build #27963 has finished for PR 4754 at commit d9e3cf0.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@brkyvz
Copy link
Contributor Author

brkyvz commented Feb 26, 2015

@tdas @pwendell @andrewor14
This is ready for code review. Moved the resolve method to DriverBootstrapper. In case the DriverBootstrapper is not called, the resolution takes place inside SparkSubmit like before. This should work (tested it with TD's example + my own package examples)

@SparkQA
Copy link

SparkQA commented Feb 26, 2015

Test build #27973 has finished for PR 4754 at commit 43c3cb2.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@tdas
Copy link
Contributor

tdas commented Feb 26, 2015

Jenkins, test this again.

@brkyvz
Copy link
Contributor Author

brkyvz commented Feb 26, 2015

It might not be a flaky test. I might have broken some Yarn feature. I'm
going to check once I get home.
On Feb 25, 2015 8:01 PM, "Tathagata Das" notifications@github.com wrote:

Jenkins, test this again.


Reply to this email directly or view it on GitHub
#4754 (comment).

@tdas
Copy link
Contributor

tdas commented Feb 26, 2015

Ohh... okay.

@SparkQA
Copy link

SparkQA commented Feb 26, 2015

Test build #622 has finished for PR 4754 at commit 43c3cb2.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@brkyvz
Copy link
Contributor Author

brkyvz commented Feb 26, 2015

@tdas The latest commit fixed the issue, feel free to test

@brkyvz brkyvz changed the title [SPARK-5979] Made --package exclusions more refined [SPARK-5979][SPARK-6031][SPARK-6032] Refactoring for --packages -> Move to SparkSubmitDriverBootstrapper Feb 26, 2015
@SparkQA
Copy link

SparkQA commented Feb 26, 2015

Test build #27989 has finished for PR 4754 at commit c73aabe.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Feb 26, 2015

Test build #27990 has finished for PR 4754 at commit 7f958c1.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Feb 26, 2015

Test build #28007 has finished for PR 4754 at commit b7a9e93.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@brkyvz
Copy link
Contributor Author

brkyvz commented Feb 26, 2015

This passed locally. What the...
On Feb 26, 2015 8:39 AM, "UCB AMPLab" notifications@github.com wrote:

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/28007/
Test FAILed.


Reply to this email directly or view it on GitHub
#4754 (comment).

@andrewor14
Copy link
Contributor

retest this please

@SparkQA
Copy link

SparkQA commented Feb 26, 2015

Test build #28011 has finished for PR 4754 at commit b7a9e93.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Feb 26, 2015

Test build #28015 has finished for PR 4754 at commit 994869e.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Feb 26, 2015

Test build #28019 has finished for PR 4754 at commit 44dbf67.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@brkyvz
Copy link
Contributor Author

brkyvz commented Feb 26, 2015

Flaky test this time... @tdas, can you have this retested please?

@srowen
Copy link
Member

srowen commented Feb 26, 2015

Jenkins, retest this please

@SparkQA
Copy link

SparkQA commented Feb 27, 2015

Test build #28023 has finished for PR 4754 at commit 44dbf67.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@brkyvz brkyvz changed the title [SPARK-5979][SPARK-6031][SPARK-6032] Refactoring for --packages -> Move to SparkSubmitDriverBootstrapper [SPARK-5979][SPARK-6031][SPARK-6032][SPARK-6047] Refactoring for --packages -> Move to SparkSubmitDriverBootstrapper Feb 27, 2015
@brkyvz
Copy link
Contributor Author

brkyvz commented Feb 27, 2015

@srowen Thank you!

@tdas
Copy link
Contributor

tdas commented Feb 27, 2015

@brkyvz I think you need to address a couple of more JIRAs in this PR. 4 aint enough ;)

newClasspath += sys.props("path.separator") +
resolvedMavenCoordinates.mkString(sys.props("path.separator"))
submitArgs =
Array("--packages-resolved", resolvedMavenCoordinates.mkString(",")) ++ submitArgs
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we thread this through using an environment variable _PACKAGES_RESOLVED? Having this as an extra flag forces you to make args here mutable, which is sort of strange.

asfgit pushed a commit that referenced this pull request Feb 28, 2015
pwendell tdas
This is the safer parts of PR #4754:
 - SPARK-5979: All dependencies with the groupId `org.apache.spark` passed through `--packages`, were being excluded from the dependency tree on the assumption that they would be in the assembly jar. This is not the case, therefore the exclusion rules had to be defined more explicitly.
 - SPARK-6032: Ivy prints a whole lot of logs while retrieving dependencies. These were printed to `System.out`. Moved the logging to `System.err`.

Author: Burak Yavuz <brkyvz@gmail.com>

Closes #4802 from brkyvz/simple-streaming-fix and squashes the following commits:

e0f38cb [Burak Yavuz] Merge branch 'master' of github.com:apache/spark into simple-streaming-fix
bad921c [Burak Yavuz] [SPARK-5979][SPARK-6032] Smaller safer fix

(cherry picked from commit 6d8e5fb)
Signed-off-by: Patrick Wendell <patrick@databricks.com>
asfgit pushed a commit that referenced this pull request Feb 28, 2015
pwendell tdas
This is the safer parts of PR #4754:
 - SPARK-5979: All dependencies with the groupId `org.apache.spark` passed through `--packages`, were being excluded from the dependency tree on the assumption that they would be in the assembly jar. This is not the case, therefore the exclusion rules had to be defined more explicitly.
 - SPARK-6032: Ivy prints a whole lot of logs while retrieving dependencies. These were printed to `System.out`. Moved the logging to `System.err`.

Author: Burak Yavuz <brkyvz@gmail.com>

Closes #4802 from brkyvz/simple-streaming-fix and squashes the following commits:

e0f38cb [Burak Yavuz] Merge branch 'master' of github.com:apache/spark into simple-streaming-fix
bad921c [Burak Yavuz] [SPARK-5979][SPARK-6032] Smaller safer fix
@pwendell
Copy link
Contributor

@brkyvz let's close this issue for now and keep it in our back pocket. We can use it if we decide to put this in the 1.3 branch down the line.

@brkyvz brkyvz closed this Feb 28, 2015
@brkyvz brkyvz deleted the streaming-dependency-fix branch February 3, 2019 20:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants