Skip to content

[WIP][SPARK-32502][BUILD] Upgrade Guava to 27.0-jre #29326

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 6 commits into from

Conversation

viirya
Copy link
Member

@viirya viirya commented Aug 2, 2020

What changes were proposed in this pull request?

This PR upgrades Guava to newer 27.0-jre.

Why are the changes needed?

Guava 14.0.1 is pretty old and is among the affected Guava versions of CVE-2018-10237.

All newer Hadoop releases are going to be built with a later guava version, e.g. 27.0-jre, including Hadoop 3.1.3, 3.2.1, 3.3.0.

Does this PR introduce any user-facing change?

No

How was this patch tested?

Pass the Jenkins tests.

@SparkQA
Copy link

SparkQA commented Aug 2, 2020

Test build #126931 has finished for PR 29326 at commit 68375f0.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@maropu
Copy link
Member

maropu commented Aug 2, 2020

retest this please

@SparkQA
Copy link

SparkQA commented Aug 2, 2020

Test build #126935 has finished for PR 29326 at commit 68375f0.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@srowen
Copy link
Member

srowen commented Aug 2, 2020

Is this duplicated by #29325 ?
Yes, this can only happen for Hadoop 3.2.1+, so would at best be in the Hadoop 3.2 profile.

@viirya
Copy link
Member Author

viirya commented Aug 3, 2020

It is a trouble that hive-exec uses a method that became package-private since Guava version 20. So there is incompatibility with Guava versions > 19.0.

sbt.ForkMain$ForkError: sbt.ForkMain$ForkError: java.lang.IllegalAccessError: tried to access method com.google.common.collect.Iterators.emptyIterator()Lcom/google/common/collect/UnmodifiableIterator; from class org.apache.hadoop.hive.ql.exec.FetchOperator
	at org.apache.hadoop.hive.ql.exec.FetchOperator.<init>(FetchOperator.java:108)
	at org.apache.hadoop.hive.ql.exec.FetchTask.initialize(FetchTask.java:87)
	at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:541)
	at org.apache.hadoop.hive.ql.Driver.compileInternal(Driver.java:1317)
	at org.apache.hadoop.hive.ql.Driver.runInternal(Driver.java:1457)
	at org.apache.hadoop.hive.ql.Driver.run(Driver.java:1237)
	at org.apache.hadoop.hive.ql.Driver.run(Driver.java:1227)

hive-exec doesn't shade Guava until https://issues.apache.org/jira/browse/HIVE-22126 that targets 4.0.0.

This seems a dead end for upgrading Guava in Spark for now.

@viirya
Copy link
Member Author

viirya commented Aug 3, 2020

Opened https://issues.apache.org/jira/browse/HIVE-23980 and see if Hive people has some ideas.

@dongjoon-hyun dongjoon-hyun marked this pull request as draft August 3, 2020 21:44
@viirya
Copy link
Member Author

viirya commented Aug 4, 2020

I did some tests. Few changes are required to pass the failed Hive tests:

  1. Shading Guava at hive-exec packaging and a few code changes to hive-common and hive-exec regarding Guava usage
  2. Don't use core classifier for hive dependencies in Spark

But this just upgrades Guava version used in Spark. Hive dependencies still use older Guava with the reported CVE.

@dongjoon-hyun
Copy link
Member

dongjoon-hyun commented Aug 7, 2020

Thank you for assessment, @viirya . Is there an official plan for Apache Spark 4.0.0 release?

Actually, this is 3rd try after mine and @HyukjinKwon 's . So, I was curious about what is changed more until now. At that time, we dropped the old PRs because it's hard to expect to get shaded Apache Hive 2.3.8.

Apache Spark just migrated to Apache Hive 2.3. I don't think we can migrate to Apache Hive 4.0.0 in next one year.

cc @gatorsmile

@dongjoon-hyun
Copy link
Member

BTW, shall we close for now? You can reopen this later when it's ready.

@viirya
Copy link
Member Author

viirya commented Aug 7, 2020

@dongjoon-hyun Thanks for the comment. Yeah, it doesn't make sense to upgrade to Hive 4 in short or midterm. I'm working on upgrade Guava 27 and shading Guava in Hive too. I hope it can be part of Hive 2.3.8.

I will close this for now. Once the work at Hive gets progress, I can reopen this. Thanks.

@viirya viirya closed this Aug 7, 2020
@dongjoon-hyun
Copy link
Member

Thank you so much. Yes. I'm looking forward to seeing that~

@danielradulov
Copy link

Hi Guys, I am having problems with Guava on Spark 3.0.0 and 3.0.1 with Hadoop 3.2.1 and Hive 3.12.

I am using Spark Operator developed by Google, all seems to work fine except when I try to use Spark integrated with Hive Metastore. In this case I am facing the following error:

java.lang.NoSuchMethodError: com.google.common.base.Preconditions.checkArgument

I have tried several workarounds like replacing Guava, Spark spec on client "spark.executor.userClassPathFirst": "true" "spark.driver.userClassPathFirst": "true", shading Guava with maven-shade-plugin and unfortunately any of this alternatives are working properly.

I hope you can be able to upgrade Guava soon in Spark.

thanks.

@dongjoon-hyun
Copy link
Member

Thank you for your opinion, @danielradulov . However, it's Apache Hive issue across Apache Hadoop versions.

except when I try to use Spark integrated with Hive Metastore.

Apache Hadoop 3.2.1 has a breaking Guava dependency change which breaks most downstream project. IIRC, there is no official Apache Hive version to work on Apache Hadoop 3.2.1. You had better ask the support to Apache Hive community.

Apache Spark community tried to upgrade to Apache Hadoop 3.2.1 (Sep. 2019) and gave up due to that.

@dbtsai
Copy link
Member

dbtsai commented Sep 17, 2020

One question, in https://issues.apache.org/jira/browse/HADOOP-14284 it seems that Hadoop shades the Guava dependency, why do we introduce breaking changes when we upgrade to Hadoop 3.2.1 or Hadoop 3.3?

@viirya
Copy link
Member Author

viirya commented Sep 17, 2020

Isn't HADOOP-14284 resolved as Invalid?

@dbtsai
Copy link
Member

dbtsai commented Sep 17, 2020

@viirya you are right. My bad.

@viirya viirya reopened this Jun 29, 2021
@viirya
Copy link
Member Author

viirya commented Jun 29, 2021

try this again.

@viirya viirya changed the title [WIP][SPARK-32502][BUILD] Upgrade Guava to 27.0-jre and Hadoop to 3.2.1 [WIP][SPARK-32502][BUILD] Upgrade Guava to 27.0-jre Jun 29, 2021
@viirya
Copy link
Member Author

viirya commented Jun 29, 2021

retest this please

@SparkQA
Copy link

SparkQA commented Jun 29, 2021

Test build #140400 has started for PR 29326 at commit 4e6da9c.

@SparkQA

This comment has been minimized.

@SparkQA

This comment has been minimized.

@viirya
Copy link
Member Author

viirya commented Jun 30, 2021

retest this please

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

27.0-jre was released on October 2018. I'm wondering if we still need to use the same version from Hadoop. Since Apache Hadoop shaded its Guava dependency and Apache Spark doesn't use it, shall we try to use the latest one, 30.1.1-jre, instead?

All newer Hadoop releases are going to be built with a later guava version, e.g. 27.0-jre, including Hadoop 3.1.3, 3.2.1, 3.3.0.

@viirya
Copy link
Member Author

viirya commented Jun 30, 2021

I'm not against to this point. I can change to latest guava and see what CI tells.

@dongjoon-hyun
Copy link
Member

Thanks. Ya, let's try with the latest one.

@SparkQA
Copy link

SparkQA commented Jun 30, 2021

Kubernetes integration test unable to build dist.

exiting with code: 1
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/44945/

@SparkQA
Copy link

SparkQA commented Jun 30, 2021

Test build #140430 has finished for PR 29326 at commit 4e6da9c.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • class GroupBy(Generic[FrameLike], metaclass=ABCMeta):
  • class SparkIndexOpsMethods(Generic[IndexOpsLike], metaclass=ABCMeta):
  • class RollingAndExpanding(Generic[FrameLike], metaclass=ABCMeta):
  • class RollingLike(RollingAndExpanding[FrameLike]):
  • class Rolling(RollingLike[FrameLike]):
  • class RollingGroupby(RollingLike[FrameLike]):
  • class ExpandingLike(RollingAndExpanding[FrameLike]):
  • class Expanding(ExpandingLike[FrameLike]):
  • class ExpandingGroupby(ExpandingLike[FrameLike]):
  • sealed trait FieldPosition extends LeafExpression with Unevaluable
  • case class UnresolvedFieldPosition(
  • case class ResolvedFieldName(path: Seq[String], field: StructField) extends FieldName
  • case class ResolvedFieldPosition(position: ColumnPosition) extends FieldPosition
  • case class ArraysZip(children: Seq[Expression], names: Seq[Expression])
  • case class AlterTableAlterColumn(
  • case class ShowCreateTableExec(

@SparkQA
Copy link

SparkQA commented Jun 30, 2021

Kubernetes integration test unable to build dist.

exiting with code: 1
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/44948/

@SparkQA
Copy link

SparkQA commented Jun 30, 2021

Test build #140434 has finished for PR 29326 at commit 74d5fbd.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@github-actions github-actions bot added the SQL label Jul 1, 2021
@SparkQA

This comment has been minimized.

@viirya
Copy link
Member Author

viirya commented Jul 1, 2021

retest this please

@SparkQA
Copy link

SparkQA commented Jul 1, 2021

Kubernetes integration test unable to build dist.

exiting with code: 1
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/45001/

@SparkQA
Copy link

SparkQA commented Jul 1, 2021

Test build #140501 has finished for PR 29326 at commit d5e8ff8.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jul 1, 2021

Kubernetes integration test unable to build dist.

exiting with code: 1
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/45011/

@viirya
Copy link
Member Author

viirya commented Jul 2, 2021

Hmm, from the failed tests below:

org.apache.spark.sql.hive.DataSourceWithHiveMetastoreCatalogSuite
org.apache.spark.sql.hive.HiveExternalCatalogSuite
org.apache.spark.sql.hive.StatisticsSuite

Since Guava 20, com.google.common.collect.Iterators.emptyIterator() is not public anymore. But I don't get it because Hive 2.3.8/2.3.9 shaded guava in hive-exec. Why it will use the newer guava upgraded here?

java.lang.IllegalAccessError: tried to access method com.google.common.collect.Iterators.emptyIterator()Lcom/google/common/collect/UnmodifiableIterator; from class org.apache.hadoop.hive.ql.exec.FetchOperator
	at org.apache.hadoop.hive.ql.exec.FetchOperator.<init>(FetchOperator.java:108)
	at org.apache.hadoop.hive.ql.exec.FetchTask.initialize(FetchTask.java:87)
	at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:541)
	at org.apache.hadoop.hive.ql.Driver.compileInternal(Driver.java:1317)
	at org.apache.hadoop.hive.ql.Driver.runInternal(Driver.java:1457)
	at org.apache.hadoop.hive.ql.Driver.run(Driver.java:1237)
	at org.apache.hadoop.hive.ql.Driver.run(Driver.java:1227)
	at org.apache.spark.sql.hive.client.HiveClientImpl.$anonfun$runHive$1(HiveClientImpl.scala:831)

I seems figured it out. Verifying it locally...

@viirya
Copy link
Member Author

viirya commented Jul 3, 2021

Encountered some issues.

Although we can switch to hive-exec without classifier (shaded version) to get rid of above guava version issue, the shaded hive-exec contains (without relocation) some dependencies like commons-lang3, orc, parquet that are not same version with Spark and so they conflict.

Because shaded hive-exec jar already includes these dependency jars, seems dependency exclusions in pom cannot exclude them.

Currently seems we can just go back to Hive to shade every included dependencies? Any other thoughts?

@sunchao
Copy link
Member

sunchao commented Jul 6, 2021

Oh I didn't even realize that Spark is using hive-exec-core jar. Does that mean it doesn't take advantage of the Guava shading from Hive 2.3.8+ at all?

One idea is to have Spark use hadoop-shaded-guava which is also 30.1.1-jre. It also makes sure that Spark always use the same Guava version as Hadoop.

@viirya
Copy link
Member Author

viirya commented Jul 6, 2021

Oh I didn't even realize that Spark is using hive-exec-core jar. Does that mean it doesn't take advantage of the Guava shading from Hive 2.3.8+ at all?

Yea, I'm afraid that it is true. If we want to completely isolate dependencies from Hive, we may need to relocate all included (but not relocated) dependencies in hive-exec w/o classifier.

One idea is to have Spark use hadoop-shaded-guava which is also 30.1.1-jre. It also makes sure that Spark always use the same Guava version as Hadoop.

Even Spark uses hadoop-shaded-guava, but hive-exec still needs older Guava if we cannot use the version w/o classifier (due to other dependencies e.g. common-lang3, orc, parquet..)

@sunchao
Copy link
Member

sunchao commented Jul 6, 2021

Hmm yea you are right, but shading the other dependencies will require another release though.

Another thing we could try is to change IsolatedClientLoader#isSharedClassand have the Hive client to use its own version of common-lang3 etc, if there is a conflict.

@viirya
Copy link
Member Author

viirya commented Jul 7, 2021

Hmm, I looked at isSharedClass, looks like common-lang3, orc, etc. are already non-shared classes.

@sunchao
Copy link
Member

sunchao commented Jul 7, 2021

Yeah that looks right. It seems for the case when spark.sql.hive.metastore.jars=builtin the isolated client loader doesn't really provide isolated classpaths for Hive.

@viirya
Copy link
Member Author

viirya commented Sep 14, 2021

#33989 seems a promising direction. Close this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants