Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-23654][BUILD] remove jets3t as a dependency of spark #21146

Closed

Conversation

steveloughran
Copy link
Contributor

@steveloughran steveloughran commented Apr 24, 2018

What changes were proposed in this pull request?

With the update of bouncy-castle JAR in Spark 2.3; jets3t doesn't work any more; hence the s3n
connector to S3 is dead. Only one person has noticed so far. The hadoop s3n connector is never going to be updated to work with a later version of jets3t, instead the code has just been cut from hadoop 3.x entirely..

This patch remove the declarations of jets3t from the POMs which include it (root and spark-core), so it is not being packaged up.

Of the transitive dependencies, one is pinned at the same version, the others removed.

  • javax.activation is kept at version 1.1.1. by an explicit import; without this it would be downgraded to version 1.0.
  • bcprov-jdk15on-1.58.jar, base64-2.3.8.jar, java-xmlbuilder-1.1.jar are all removed. They are not directly used in Spark.

How was this patch tested?

Seeing what jenkins has to say about the missing/changed dependencies.

@SparkQA
Copy link

SparkQA commented Apr 24, 2018

Test build #89800 has finished for PR 21146 at commit c011acc.

  • This patch fails build dependency tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@steveloughran
Copy link
Contributor Author

As promised, dependencies fail

diff --git a/dev/deps/spark-deps-hadoop-2.6 b/dev/pr-deps/spark-deps-hadoop-2.6
index 32b2e4f..609eeb9 100644
--- a/dev/deps/spark-deps-hadoop-2.6
+++ b/dev/pr-deps/spark-deps-hadoop-2.6
@@ -1,7 +1,7 @@
 JavaEWAH-0.3.2.jar
 RoaringBitmap-0.5.11.jar
 ST4-4.0.4.jar
-activation-1.1.1.jar
+activation-1.1.jar
 aircompressor-0.8.jar
 antlr-2.7.7.jar
 antlr-runtime-3.4.jar
@@ -21,8 +21,6 @@ automaton-1.11-8.jar
 avro-1.7.7.jar
 avro-ipc-1.7.7.jar
 avro-mapred-1.7.7-hadoop2.jar
-base64-2.3.8.jar
-bcprov-jdk15on-1.58.jar
 bonecp-0.8.0.RELEASE.jar
 breeze-macros_2.11-0.13.2.jar
 breeze_2.11-0.13.2.jar
@@ -101,7 +99,6 @@ jackson-module-paranamer-2.7.9.jar
 jackson-module-scala_2.11-2.6.7.1.jar
 jackson-xc-1.9.13.jar
 janino-3.0.8.jar
-java-xmlbuilder-1.1.jar
 javassist-3.18.1-GA.jar
 javax.annotation-api-1.2.jar
 javax.inject-1.jar

@srowen
Copy link
Member

srowen commented Jul 2, 2018

Seems OK to remove if it's really not being used, except that it means it's not leaking onto the user classpath anymore, right? that is, this can potentially cause user code to stop working if this goes in 2.4, or am I missing something? Could go in for 3.0. The other dependency changes seem OK.

@steveloughran
Copy link
Contributor Author

It's not going to stop user code from working as the bouncy castle version on the classpath means that Jets3t doesn't actually work.

The fact that nobody has complained about this must count as a metric of how many people use jets3t :)

More of an issue is the fact that the httpclient in 2.3+ isn't compatible with the AWS SDK in Hadoop 2.8.x. ...

@steveloughran
Copy link
Contributor Author

The transitive dependencies are a separate issue. Jets3t pulls in 3 JARs which nothing else seems to need, but which transitively go onto the spark CP

downgraded

  • javax.activation : could/should be bumped up to 1.1.1 again;

missing

  • bcprov-jdk15on-1.58.jar ? really part of bouncy castle
  • base64-2.3.8.jar, There are so many base-64 encoders on the average classpath nobody will be short of that one
  • java-xmlbuilder-1.1.jar ? there are newer versions; removing it and letting recipients choose their own would be wiser

@srowen
Copy link
Member

srowen commented Jul 2, 2018

It seems reasonable to proceed to remove this, and just reflect the rest of the dep changes accordingly.

I wonder if updating bcprov actually makes it work or something?

Or, could one say that updating bouncy castle at some point was an error that caused this to stop working? if it was long ago, well, maybe no use in undoing it. But if it happened very recently maybe we need to revisit.

@steveloughran
Copy link
Contributor Author

There's usually good reasons for upgrading crypto stuff like bouncy castle; nothing to feel bad about.

How about I take this patch & add the explicit activation 1.1.1 ref to reinstate it, leave the rest out?

@srowen
Copy link
Member

srowen commented Jul 3, 2018

I guess leave a note about why the activation artifact is managed there. Safer to keep it I guess. That artifact will almost surely not change.

@steveloughran
Copy link
Contributor Author

BTW, the activation framework (primariy used for some mime type stuff) is still being developed, now on github at @javaee https://github.com/javaee/activation. At least it is being maintained

@steveloughran steveloughran changed the title [SPARK-23654][BUILD][WiP] remove jets3t as a dependency of spark [SPARK-23654][BUILD] remove jets3t as a dependency of spark Jul 4, 2018
@steveloughran
Copy link
Contributor Author

OK, I've reinstated javax.activation 1.1.1 as an export from spark core (over v 1.1), point to this JIRA in the comments, and updated the -deps lists to remove the others.

Removing jets3t is the right thing to do. It's never going to work again & while the AWS SDK is equally troublesome to keep up to date, the hadoop-aws's move to a shaded JAR removes transitive dependency conflict as a source of friction and pain.

pom.xml Outdated
@@ -141,7 +141,7 @@
<codahale.metrics.version>3.1.5</codahale.metrics.version>
<avro.version>1.7.7</avro.version>
<avro.mapred.classifier>hadoop2</avro.mapred.classifier>
<jets3t.version>0.9.4</jets3t.version>
<javax.activation.version>1.1.1</javax.activation.version>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is fine; if it's a version only ever used once, I also find it OK to inline to avoid one level of indirection.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll inline. It's not like its one of those dependencies that anyone doing their own build will ever want to change.

@@ -21,8 +21,6 @@ automaton-1.11-8.jar
avro-1.7.7.jar
avro-ipc-1.7.7.jar
avro-mapred-1.7.7-hadoop2.jar
base64-2.3.8.jar
bcprov-jdk15on-1.58.jar
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, so bouncy castle is also removed? Maybe I miss what bcprov is, but I think it's the "bouncycastle provider". If so, the license in licenses-binary/ and in LICENSE-binary should be removed. And there's an entry separately in the parent POM that brings this in on account of jets3t, it says. That can go?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if there's nothing else for bouncy castle then it's cleaner for classpaths if it goes, as for imports.

looking at Hadoop, it it pulls in bcprov-jdk16, but only for testing. and with out any explicit calls on bouncy castle APIs that I can see. Presumably its registered as a crypto

     <dependency>
       <groupId>org.bouncycastle</groupId>
       <artifactId>bcprov-jdk16</artifactId>
       <version>1.46</version>
       <scope>test</scope>
     </dependency>

If it is going, I should update this PR & JIRA name to make clear it is gone

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds like it's already been removed, so the other 'remnants' should go.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@steveloughran I'll merge this if you can remove the bouncy castle license info and take one more pass at comments here.

@SparkQA
Copy link

SparkQA commented Jul 4, 2018

Test build #92614 has finished for PR 21146 at commit 965c73c.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@srowen
Copy link
Member

srowen commented Aug 11, 2018

@steveloughran since I wanted to expedite this -- it relates to ECCNs which I'm fixing up -- I opened #22081 WDYT?

@steveloughran steveloughran force-pushed the cloud/SPARK-23654-jets3t branch from 965c73c to 9adc0ec Compare August 13, 2018 18:11
@steveloughran
Copy link
Contributor Author

closing as #22081 supercedes it

@SparkQA
Copy link

SparkQA commented Aug 13, 2018

Test build #94702 has finished for PR 21146 at commit 9adc0ec.

  • This patch fails Spark unit tests.
  • This patch does not merge cleanly.
  • This patch adds no public classes.

@steveloughran
Copy link
Contributor Author

FYI, I just did a kinesis test run with this PR on a JVM with the unlimited JCE installed (explicitly verified by shasum of the relevant JARs); failure with cert errors.

success on second attempt after a Kinesis throttling exception
- retry success on second attempt after a Kinesis dependency exception
- retry failed after a shutdown exception
- retry failed after an invalid state exception
- retry failed after unexpected exception
- retry failed after exhausting all retries
WithAggregationKinesisBackedBlockRDDSuite:
[2018-08-13 22:01:12.872175] [0x000070000b6f9000] [info] [kinesis_producer.cc:79] Created pipeline for stream "KinesisTestUtils-3116048591482212471"
[2018-08-13 22:01:12.872918] [0x000070000b6f9000] [info] [shard_map.cc:83] Updating shard map for stream "KinesisTestUtils-3116048591482212471"
[2018-08-13 22:01:13.040026] [0x000070000b905000] [error] [http_client.cc:148] Failed to open connection to monitoring.us-west-2.amazonaws.com:443 : certificate verify failed
[2018-08-13 22:01:13.054034] [0x000070000b905000] [error] [http_client.cc:148] Failed to open connection to kinesis.us-west-2.amazonaws.com:443 : certificate verify failed
[2018-08-13 22:01:15.943229] [0x000070000b905000] [error] [http_client.cc:148] Failed to open connection to kinesis.us-west-2.amazonaws.com:443 : certificate verify failed
[2018-08-13 22:01:18.924473] [0x000070000b905000] [error] [http_client.cc:148] Failed to open connection to kinesis.us-west-2.amazonaws.com:443 : certificate verify failed
[2018-08-13 22:01:21.919673] [0x000070000b905000] [error] [http_client.cc:148] Failed to open connection to kinesis.us-west-2.amazonaws.com:443 : certificate verify failed
[2018-08-13 22:01:25.121685] [0x000070000b905000] [error] [http_client.cc:148] Failed to open connection to kinesis.us-west-2.amazonaws.com:443 : certificate verify failed
[2018-08-13 22:01:27.925785] [0x000070000b905000] [error] [http_client.cc:148] Failed to open connection to kinesis.us-west-2.amazonaws.com:443 : certificate verify failed
[2018-08-13 22:01:30.917030] [0x000070000b905000] [error] [http_client.cc:148] Failed to open connection to kinesis.us-west-2.amazonaws.com:443 : certificate verify failed
[2018-08-13 22:01:33.960962] [0x000070000b905000] [error] [http_client.cc:148] Failed to open connection to kinesis.us-west-2.amazonaws.com:443 : certificate verify failed
[2018-08-13 22:01:36.926987] [0x000070000b905000] [error] [http_client.cc:148] Failed to open connection to kinesis.us-west-2.amazonaws.com:443 : certificate verify failed
[2018-08-13 22:01:39.912528] [0x000070000b905000] [error] [http_client.cc:148] Failed to open connection to kinesis.us-west-2.amazonaws.com:443 : certificate verify failed
[2018-08-13 22:01:42.911743] [0x000070000b905000] [error] [http_client.cc:148] Failed to open connection to kinesis.us-west-2.amazonaws.com:443 : certificate verify failed
org.apache.spark.streaming.kinesis.WithAggregationKinesisBackedBlockRDDSuite *** ABORTED ***
  java.lang.IllegalArgumentException: requirement failed: Need data to be sent to multiple shards
  at scala.Predef$.require(Predef.scala:224)
  at org.apache.spark.streaming.kinesis.KinesisBackedBlockRDDTests$$anonfun$beforeAll$1.apply$mcV$sp(KinesisBackedBlockRDDSuite.scala:47)
  at org.apache.spark.streaming.kinesis.KinesisFunSuite$class.runIfTestsEnabled(KinesisFunSuite.scala:41)
  at org.apache.spark.streaming.kinesis.KinesisBackedBlockRDDTests.runIfTestsEnabled(KinesisBackedBlockRDDSuite.scala:25)
  at org.apache.spark.streaming.kinesis.KinesisBackedBlockRDDTests.beforeAll(KinesisBackedBlockRDDSuite.scala:42)
  at org.scalatest.BeforeAndAfterAll$class.liftedTree1$1(BeforeAndAfterAll.scala:212)
  at org.scalatest.BeforeAndAfterAll$class.run(BeforeAndAfterAll.scala:210)
  at org.apache.spark.SparkFunSuite.run(SparkFunSuite.scala:52)
  at org.scalatest.Suite$class.callExecuteOnSuite$1(Suite.scala:1210)
  at org.scalatest.Suite$$anonfun$runNestedSuites$1.apply(Suite.scala:1257)
  ...

the assert failure is possibly a followon from the previous problem

@steveloughran
Copy link
Contributor Author

And if you bump up the kinesis client and AWS SDK version to 1.11.271, those failures go away.

Run completed in 15 minutes, 28 seconds.
Total number of tests run: 59
Suites: completed 9, aborted 0
Tests: succeeded 59, failed 0, canceled 0, ignored 0, pending 0
All tests passed.
[INFO] ------------------------------------------------------------------------
[INFO] BUILD SUCCESS
[INFO] ------------------------------------------------------------------------
[INFO] Total time: 16:09 min
[INFO] Finished at: 2018-08-13T22:35:17-07:00
[INFO] ------------------------------------------------------------------------

This also updates the dev/deps dependency files so it is not expected

Change-Id: I15798710925d378de97523f7f89dbe5bd1cc8582
…e dependencies so the version it transitively exports is the same as the one it used to when jets3t was on the import graph. Remove the other JARs from the deps list.

Change-Id: If22bb1b26381e2e8a3df050a535d644f45d306fe
Change-Id: I1298d6c3063ece8bc86575d781a85ba5309cda39
…cense files

Change-Id: Iaf2e87ca57d46592551cfdbe6d5f8b419a92bcae
…k to match

Change-Id: Ibf4162fca33189086bef234b6752f403a06aa7b0
@steveloughran steveloughran force-pushed the cloud/SPARK-23654-jets3t branch from 853ee1c to 3ac1d63 Compare August 14, 2018 21:06
@SparkQA
Copy link

SparkQA commented Aug 15, 2018

Test build #94770 has finished for PR 21146 at commit 3ac1d63.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Aug 15, 2018

Test build #94769 has finished for PR 21146 at commit 853ee1c.

  • This patch fails from timeout after a configured wait of `300m`.
  • This patch does not merge cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Aug 15, 2018

Test build #4258 has finished for PR 21146 at commit 3ac1d63.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@steveloughran
Copy link
Contributor Author

Closing now we have a test run with the combination of: No jets3t, no bouncy castle, upgraded kinesis. all the kinesis tests now run

shaneknapp pushed a commit to shaneknapp/spark that referenced this pull request Aug 15, 2018
…ions

This PR has been superceded by apache#22081

## What changes were proposed in this pull request?

Increment the kinesis client, producer and transient AWS SDK versions to a more recent release.

This is to help with the move off bouncy castle of apache#21146 and apache#22081; the goal is that moving up to the new SDK will allow a JVM with unlimited JCE but without bouncy castle to work with Kinesis endpoints.

Why this specific set of artifacts? it syncs up with the 1.11.271 AWS SDK used by hadoop 3.0.3, hadoop-3.1. and hadoop 3.1.1; that's been stable for the uses there (s3, STS, dynamo).

## How was this patch tested?

Running all the external/kinesis-asl tests via maven with java 8.121 & unlimited JCE, without bouncy castle (apache#21146); default endpoint of us-west.2. Without this SDK update I was getting http cert validation errors, with it they went away.

# This PR is not ready without

* Jenkins test runs to see what it is happy with
* more testing: repeated runs, another endpoint
* looking at the new deprecation warnings and selectively addressing them (the AWS SDKs are pretty aggressive about deprecation, but sometimes they increase the complexity of the client code or block some codepaths off completely)

Closes apache#22099 from steveloughran/cloud/SPARK-25111-kinesis.

Authored-by: Steve Loughran <stevel@hortonworks.com>
Signed-off-by: Sean Owen <sean.owen@databricks.com>
asfgit pushed a commit that referenced this pull request Aug 16, 2018
## What changes were proposed in this pull request?

Remove jets3t dependency, and bouncy castle which it brings in; update licenses and deps
Note this just takes over #21146

## How was this patch tested?

Existing tests.

Closes #22081 from srowen/SPARK-23654.

Authored-by: Sean Owen <srowen@gmail.com>
Signed-off-by: Sean Owen <sean.owen@databricks.com>
@steveloughran steveloughran deleted the cloud/SPARK-23654-jets3t branch August 20, 2018 04:51
sumwale pushed a commit to TIBCOSoftware/snappy-spark that referenced this pull request Jun 10, 2022
Remove jets3t dependency, and bouncy castle which it brings in; update licenses and deps
Note this just takes over apache#21146

Existing tests.

Closes apache#22081 from srowen/SPARK-23654.

Authored-by: Sean Owen <srowen@gmail.com>
Signed-off-by: Sean Owen <sean.owen@databricks.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants