[WIP][SPARK-32502][BUILD] Upgrade Guava to 27.0-jre #29326

viirya · 2020-08-02T06:10:53Z

What changes were proposed in this pull request?

This PR upgrades Guava to newer 27.0-jre.

Why are the changes needed?

Guava 14.0.1 is pretty old and is among the affected Guava versions of CVE-2018-10237.

All newer Hadoop releases are going to be built with a later guava version, e.g. 27.0-jre, including Hadoop 3.1.3, 3.2.1, 3.3.0.

Does this PR introduce any user-facing change?

No

How was this patch tested?

Pass the Jenkins tests.

SparkQA · 2020-08-02T07:05:02Z

Test build #126931 has finished for PR 29326 at commit 68375f0.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

maropu · 2020-08-02T08:03:28Z

retest this please

SparkQA · 2020-08-02T09:54:08Z

Test build #126935 has finished for PR 29326 at commit 68375f0.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

srowen · 2020-08-02T15:45:18Z

Is this duplicated by #29325 ?
Yes, this can only happen for Hadoop 3.2.1+, so would at best be in the Hadoop 3.2 profile.

viirya · 2020-08-03T04:50:03Z

It is a trouble that hive-exec uses a method that became package-private since Guava version 20. So there is incompatibility with Guava versions > 19.0.

sbt.ForkMain$ForkError: sbt.ForkMain$ForkError: java.lang.IllegalAccessError: tried to access method com.google.common.collect.Iterators.emptyIterator()Lcom/google/common/collect/UnmodifiableIterator; from class org.apache.hadoop.hive.ql.exec.FetchOperator
	at org.apache.hadoop.hive.ql.exec.FetchOperator.<init>(FetchOperator.java:108)
	at org.apache.hadoop.hive.ql.exec.FetchTask.initialize(FetchTask.java:87)
	at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:541)
	at org.apache.hadoop.hive.ql.Driver.compileInternal(Driver.java:1317)
	at org.apache.hadoop.hive.ql.Driver.runInternal(Driver.java:1457)
	at org.apache.hadoop.hive.ql.Driver.run(Driver.java:1237)
	at org.apache.hadoop.hive.ql.Driver.run(Driver.java:1227)

hive-exec doesn't shade Guava until https://issues.apache.org/jira/browse/HIVE-22126 that targets 4.0.0.

This seems a dead end for upgrading Guava in Spark for now.

viirya · 2020-08-03T16:27:14Z

Opened https://issues.apache.org/jira/browse/HIVE-23980 and see if Hive people has some ideas.

viirya · 2020-08-04T23:28:41Z

I did some tests. Few changes are required to pass the failed Hive tests:

Shading Guava at hive-exec packaging and a few code changes to hive-common and hive-exec regarding Guava usage
Don't use core classifier for hive dependencies in Spark

But this just upgrades Guava version used in Spark. Hive dependencies still use older Guava with the reported CVE.

dongjoon-hyun · 2020-08-07T16:28:20Z

Thank you for assessment, @viirya . Is there an official plan for Apache Spark 4.0.0 release?

Actually, this is 3rd try after mine and @HyukjinKwon 's . So, I was curious about what is changed more until now. At that time, we dropped the old PRs because it's hard to expect to get shaded Apache Hive 2.3.8.

Apache Spark just migrated to Apache Hive 2.3. I don't think we can migrate to Apache Hive 4.0.0 in next one year.

cc @gatorsmile

dongjoon-hyun · 2020-08-07T16:30:49Z

BTW, shall we close for now? You can reopen this later when it's ready.

viirya · 2020-08-07T16:47:04Z

@dongjoon-hyun Thanks for the comment. Yeah, it doesn't make sense to upgrade to Hive 4 in short or midterm. I'm working on upgrade Guava 27 and shading Guava in Hive too. I hope it can be part of Hive 2.3.8.

I will close this for now. Once the work at Hive gets progress, I can reopen this. Thanks.

dongjoon-hyun · 2020-08-07T18:44:50Z

Thank you so much. Yes. I'm looking forward to seeing that~

danielradulov · 2020-09-13T16:01:24Z

Hi Guys, I am having problems with Guava on Spark 3.0.0 and 3.0.1 with Hadoop 3.2.1 and Hive 3.12.

I am using Spark Operator developed by Google, all seems to work fine except when I try to use Spark integrated with Hive Metastore. In this case I am facing the following error:

java.lang.NoSuchMethodError: com.google.common.base.Preconditions.checkArgument

I have tried several workarounds like replacing Guava, Spark spec on client "spark.executor.userClassPathFirst": "true" "spark.driver.userClassPathFirst": "true", shading Guava with maven-shade-plugin and unfortunately any of this alternatives are working properly.

I hope you can be able to upgrade Guava soon in Spark.

thanks.

dongjoon-hyun · 2020-09-14T03:59:52Z

Thank you for your opinion, @danielradulov . However, it's Apache Hive issue across Apache Hadoop versions.

except when I try to use Spark integrated with Hive Metastore.

Apache Hadoop 3.2.1 has a breaking Guava dependency change which breaks most downstream project. IIRC, there is no official Apache Hive version to work on Apache Hadoop 3.2.1. You had better ask the support to Apache Hive community.

Apache Spark community tried to upgrade to Apache Hadoop 3.2.1 (Sep. 2019) and gave up due to that.

[WIP][SPARK-29250][BUILD][test-hadoop3.2][test-maven] Upgrade to Hadoop 3.2.1 #25932

dbtsai · 2020-09-17T16:28:22Z

One question, in https://issues.apache.org/jira/browse/HADOOP-14284 it seems that Hadoop shades the Guava dependency, why do we introduce breaking changes when we upgrade to Hadoop 3.2.1 or Hadoop 3.3?

viirya · 2020-09-17T16:38:44Z

Isn't HADOOP-14284 resolved as Invalid?

dbtsai · 2020-09-17T17:15:12Z

@viirya you are right. My bad.

viirya · 2021-06-29T18:14:23Z

try this again.

viirya · 2021-06-29T19:16:41Z

retest this please

SparkQA · 2021-06-29T20:32:05Z

Test build #140400 has started for PR 29326 at commit 4e6da9c.

viirya · 2021-06-30T05:12:45Z

retest this please

dongjoon-hyun

27.0-jre was released on October 2018. I'm wondering if we still need to use the same version from Hadoop. Since Apache Hadoop shaded its Guava dependency and Apache Spark doesn't use it, shall we try to use the latest one, 30.1.1-jre, instead?

All newer Hadoop releases are going to be built with a later guava version, e.g. 27.0-jre, including Hadoop 3.1.3, 3.2.1, 3.3.0.

viirya · 2021-06-30T06:07:39Z

I'm not against to this point. I can change to latest guava and see what CI tells.

dongjoon-hyun · 2021-06-30T06:30:18Z

Thanks. Ya, let's try with the latest one.

SparkQA · 2021-06-30T06:40:23Z

Kubernetes integration test unable to build dist.

exiting with code: 1
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/44945/

SparkQA · 2021-06-30T07:49:43Z

Test build #140430 has finished for PR 29326 at commit 4e6da9c.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
class GroupBy(Generic[FrameLike], metaclass=ABCMeta):
class SparkIndexOpsMethods(Generic[IndexOpsLike], metaclass=ABCMeta):
class RollingAndExpanding(Generic[FrameLike], metaclass=ABCMeta):
class RollingLike(RollingAndExpanding[FrameLike]):
class Rolling(RollingLike[FrameLike]):
class RollingGroupby(RollingLike[FrameLike]):
class ExpandingLike(RollingAndExpanding[FrameLike]):
class Expanding(ExpandingLike[FrameLike]):
class ExpandingGroupby(ExpandingLike[FrameLike]):
sealed trait FieldPosition extends LeafExpression with Unevaluable
case class UnresolvedFieldPosition(
case class ResolvedFieldName(path: Seq[String], field: StructField) extends FieldName
case class ResolvedFieldPosition(position: ColumnPosition) extends FieldPosition
case class ArraysZip(children: Seq[Expression], names: Seq[Expression])
case class AlterTableAlterColumn(
case class ShowCreateTableExec(

SparkQA · 2021-06-30T09:26:54Z

Kubernetes integration test unable to build dist.

exiting with code: 1
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/44948/

SparkQA · 2021-06-30T09:52:05Z

Test build #140434 has finished for PR 29326 at commit 74d5fbd.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

viirya · 2021-07-01T07:17:27Z

retest this please

SparkQA · 2021-07-01T07:34:48Z

Kubernetes integration test unable to build dist.

exiting with code: 1
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/45001/

SparkQA · 2021-07-01T09:46:40Z

Test build #140501 has finished for PR 29326 at commit d5e8ff8.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2021-07-01T09:48:06Z

Kubernetes integration test unable to build dist.

exiting with code: 1
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/45011/

viirya · 2021-07-02T05:52:11Z

Hmm, from the failed tests below:

org.apache.spark.sql.hive.DataSourceWithHiveMetastoreCatalogSuite
org.apache.spark.sql.hive.HiveExternalCatalogSuite
org.apache.spark.sql.hive.StatisticsSuite

Since Guava 20, com.google.common.collect.Iterators.emptyIterator() is not public anymore. But I don't get it because Hive 2.3.8/2.3.9 shaded guava in hive-exec. Why it will use the newer guava upgraded here?

java.lang.IllegalAccessError: tried to access method com.google.common.collect.Iterators.emptyIterator()Lcom/google/common/collect/UnmodifiableIterator; from class org.apache.hadoop.hive.ql.exec.FetchOperator
	at org.apache.hadoop.hive.ql.exec.FetchOperator.<init>(FetchOperator.java:108)
	at org.apache.hadoop.hive.ql.exec.FetchTask.initialize(FetchTask.java:87)
	at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:541)
	at org.apache.hadoop.hive.ql.Driver.compileInternal(Driver.java:1317)
	at org.apache.hadoop.hive.ql.Driver.runInternal(Driver.java:1457)
	at org.apache.hadoop.hive.ql.Driver.run(Driver.java:1237)
	at org.apache.hadoop.hive.ql.Driver.run(Driver.java:1227)
	at org.apache.spark.sql.hive.client.HiveClientImpl.$anonfun$runHive$1(HiveClientImpl.scala:831)

I seems figured it out. Verifying it locally...

viirya · 2021-07-03T07:02:06Z

Encountered some issues.

Although we can switch to hive-exec without classifier (shaded version) to get rid of above guava version issue, the shaded hive-exec contains (without relocation) some dependencies like commons-lang3, orc, parquet that are not same version with Spark and so they conflict.

Because shaded hive-exec jar already includes these dependency jars, seems dependency exclusions in pom cannot exclude them.

Currently seems we can just go back to Hive to shade every included dependencies? Any other thoughts?

sunchao · 2021-07-06T16:43:50Z

Oh I didn't even realize that Spark is using hive-exec-core jar. Does that mean it doesn't take advantage of the Guava shading from Hive 2.3.8+ at all?

One idea is to have Spark use hadoop-shaded-guava which is also 30.1.1-jre. It also makes sure that Spark always use the same Guava version as Hadoop.

viirya · 2021-07-06T20:28:51Z

Oh I didn't even realize that Spark is using hive-exec-core jar. Does that mean it doesn't take advantage of the Guava shading from Hive 2.3.8+ at all?

Yea, I'm afraid that it is true. If we want to completely isolate dependencies from Hive, we may need to relocate all included (but not relocated) dependencies in hive-exec w/o classifier.

One idea is to have Spark use hadoop-shaded-guava which is also 30.1.1-jre. It also makes sure that Spark always use the same Guava version as Hadoop.

Even Spark uses hadoop-shaded-guava, but hive-exec still needs older Guava if we cannot use the version w/o classifier (due to other dependencies e.g. common-lang3, orc, parquet..)

sunchao · 2021-07-06T21:29:48Z

Hmm yea you are right, but shading the other dependencies will require another release though.

Another thing we could try is to change IsolatedClientLoader#isSharedClassand have the Hive client to use its own version of common-lang3 etc, if there is a conflict.

viirya · 2021-07-07T00:39:26Z

Hmm, I looked at isSharedClass, looks like common-lang3, orc, etc. are already non-shared classes.

sunchao · 2021-07-07T17:13:48Z

Yeah that looks right. It seems for the case when spark.sql.hive.metastore.jars=builtin the isolated client loader doesn't really provide isolated classpaths for Hive.

viirya · 2021-09-14T16:21:39Z

#33989 seems a promising direction. Close this.

Upgrade guava and hadoop.

68375f0

probot-autolabeler bot added the BUILD label Aug 2, 2020

viirya mentioned this pull request Aug 2, 2020

[WIP][SPARK-32502][BUILD] Upgrade Guava to 27.0-jre #29325

Closed

dongjoon-hyun marked this pull request as draft August 3, 2020 21:44

viirya closed this Aug 7, 2020

viirya reopened this Jun 29, 2021

viirya added 2 commits June 29, 2021 11:50

Merge remote-tracking branch 'upstream/master' into upgrade

40fc8ee

Merge remote-tracking branch 'upstream/master' into upgrade

4e6da9c

viirya changed the title ~~[WIP][SPARK-32502][BUILD] Upgrade Guava to 27.0-jre and Hadoop to 3.2.1~~ [WIP][SPARK-32502][BUILD] Upgrade Guava to 27.0-jre Jun 29, 2021

This comment has been minimized.

Sign in to view

dongjoon-hyun reviewed Jun 30, 2021

View reviewed changes

Try 30.1.1-jre.

74d5fbd

viirya added 2 commits June 30, 2021 19:18

Merge remote-tracking branch 'upstream/master' into upgrade

1babfb0

Fix tests.

d5e8ff8

github-actions bot added the SQL label Jul 1, 2021

This comment has been minimized.

Sign in to view

viirya closed this Sep 14, 2021

viirya deleted the upgrade branch September 14, 2021 18:48

JoshRosen mentioned this pull request Sep 24, 2021

[SPARK-36676][SQL][BUILD] Create shaded Hive module and upgrade Guava version to 30.1.1-jre #33989

Closed

sunchao mentioned this pull request Sep 27, 2021

[SPARK-36864][BUILD] Fix guava version mismatch with hadoop-aws #34117

Closed

bjornjorgensen mentioned this pull request Feb 21, 2022

[SPARK-38262][BUILD] Upgrade Google guava to version 30.0-jre #35584

Closed

[WIP][SPARK-32502][BUILD] Upgrade Guava to 27.0-jre #29326

[WIP][SPARK-32502][BUILD] Upgrade Guava to 27.0-jre #29326

Uh oh!

Conversation

viirya commented Aug 2, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

SparkQA commented Aug 2, 2020

Uh oh!

maropu commented Aug 2, 2020

Uh oh!

SparkQA commented Aug 2, 2020

Uh oh!

srowen commented Aug 2, 2020

Uh oh!

viirya commented Aug 3, 2020

Uh oh!

viirya commented Aug 3, 2020

Uh oh!

viirya commented Aug 4, 2020

Uh oh!

dongjoon-hyun commented Aug 7, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dongjoon-hyun commented Aug 7, 2020

Uh oh!

viirya commented Aug 7, 2020

Uh oh!

dongjoon-hyun commented Aug 7, 2020

Uh oh!

danielradulov commented Sep 13, 2020

Uh oh!

dongjoon-hyun commented Sep 14, 2020

Uh oh!

dbtsai commented Sep 17, 2020

Uh oh!

viirya commented Sep 17, 2020

Uh oh!

dbtsai commented Sep 17, 2020

Uh oh!

viirya commented Jun 29, 2021

Uh oh!

viirya commented Jun 29, 2021

Uh oh!

SparkQA commented Jun 29, 2021

Uh oh!

This comment has been minimized.

This comment has been minimized.

viirya commented Jun 30, 2021

Uh oh!

dongjoon-hyun left a comment

Choose a reason for hiding this comment

Uh oh!

viirya commented Jun 30, 2021

Uh oh!

dongjoon-hyun commented Jun 30, 2021

Uh oh!

SparkQA commented Jun 30, 2021

Uh oh!

SparkQA commented Jun 30, 2021

Uh oh!

SparkQA commented Jun 30, 2021

Uh oh!

SparkQA commented Jun 30, 2021

Uh oh!

This comment has been minimized.

viirya commented Jul 1, 2021

Uh oh!

SparkQA commented Jul 1, 2021

Uh oh!

SparkQA commented Jul 1, 2021

Uh oh!

SparkQA commented Jul 1, 2021

Uh oh!

viirya commented Jul 2, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

viirya commented Aug 2, 2020 •

edited

Loading

dongjoon-hyun commented Aug 7, 2020 •

edited

Loading

viirya commented Jul 2, 2021 •

edited

Loading

viirya commented Jul 3, 2021 •

edited

Loading