Skip to content

[WIP][SPARK-27733][CORE] Upgrade Avro to 1.9.2 #27609

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 3 commits into from

Conversation

iemejia
Copy link
Member

@iemejia iemejia commented Feb 17, 2020

What changes were proposed in this pull request?

This PR upgrade parquet to 1.11.0.

Why are the changes needed?

Because Spark lags behind major improvements and cleanups in Avro and also it can remove some extra dependencies (e.g. paranamer and maybe the old versions of jackson that have security vulnerabilities and are still present on Avro 1.8.x and 1.7.x).

Also Parquet 1.11.0 needs Avro 1.9.x to run so we can get more issues because of it. For ref. #26804

Does this PR introduce any user-facing change?

Unknown

How was this patch tested?

Partially, some parts are still failing so this is for the moment a WIP to get some feedback specially in the situation of the transitive Hive dependencies. PTAL at the Jira ticket SPARK-27733 for more details on the dependencies issues.

What has been done so far?

  • Upgrade Avro version to 1.9.2
  • Add explicit xz dependency to spark-avro (It was removed in Avro 1.9.x)
  • Remove avro-ipc because that's a transitive dependency of avro-mapred and no
    code on Spark should be using this one directly.
    Also the dependency of avro-mapred depends on avro-ipc tests was removed in
    1.8.x so probably that's still there for compatibility with old Hive
  • Remove avro.mapred.classifier because it does not exist anymore on Avro 1.9.x
  • Remove paranamer because it was removed from Avro (1.9.x) after move to Java 8
    jackson-module-paranamer does not exist anymore in the code base (Note I could not get rid of the paranamer deps in dev/deps because it is coming transitively from jackson-module-scala_2.11.

@wangyum
Copy link
Member

wangyum commented Feb 18, 2020

ok to test.

@SparkQA
Copy link

SparkQA commented Feb 18, 2020

Test build #118608 has finished for PR 27609 at commit d4b1bda.

  • This patch fails build dependency tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dongjoon-hyun
Copy link
Member

Hi, @iemejia . Could you fix the PR description? This PR is about Avro instead of Parquet. :)

What changes were proposed in this pull request?

This PR upgrade parquet to 1.11.0.

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW, for the build failure, please do the following in your branch to update the dependency manifest.

dev/test-dependencies.sh --replace-manifest

@iemejia iemejia changed the title [WIP][SPARK-27733][CORE] Upgrade Avro to version 1.9.2 [WIP][SPARK-27733][CORE] Upgrade Avro to 1.9.2 Feb 18, 2020
@iemejia iemejia force-pushed the SPARK-27733-avro-upgrade branch from d4b1bda to f667d62 Compare February 18, 2020 10:19
@iemejia
Copy link
Member Author

iemejia commented Feb 18, 2020

Thanks @dongjoon-hyun I am new to the Spark dev process so any hints/feedback is greatly appreciated.

@SparkQA
Copy link

SparkQA commented Feb 18, 2020

Test build #118628 has finished for PR 27609 at commit f667d62.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Feb 18, 2020

Test build #118636 has finished for PR 27609 at commit d1c7894.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@Fokko
Copy link
Contributor

Fokko commented Feb 19, 2020

Hive is still depending on an older version of Avro:

sbt.ForkMain$ForkError: sbt.ForkMain$ForkError: java.lang.NoSuchMethodError: org.apache.avro.Schema.getJsonProp(Ljava/lang/String;)Lorg/codehaus/jackson/JsonNode;
	at org.apache.hadoop.hive.serde2.avro.SchemaToTypeInfo.generateTypeInfo(SchemaToTypeInfo.java:139)
	at org.apache.hadoop.hive.serde2.avro.SchemaToTypeInfo.generateTypeInfoWorker(SchemaToTypeInfo.java:194)
	at org.apache.hadoop.hive.serde2.avro.SchemaToTypeInfo.access$000(SchemaToTypeInfo.java:46)
	at org.apache.hadoop.hive.serde2.avro.SchemaToTypeInfo$1.makeInstance(SchemaToTypeInfo.java:119)
	at org.apache.hadoop.hive.serde2.avro.SchemaToTypeInfo$1.makeInstance(SchemaToTypeInfo.java:114)
	at org.apache.hadoop.hive.serde2.avro.InstanceCache.retrieve(InstanceCache.java:65)
	at org.apache.hadoop.hive.serde2.avro.SchemaToTypeInfo.generateTypeInfo(SchemaToTypeInfo.java:186)
	at org.apache.hadoop.hive.serde2.avro.SchemaToTypeInfo.generateColumnTypes(SchemaToTypeInfo.java:108)
	at org.apache.hadoop.hive.serde2.avro.SchemaToTypeInfo.generateColumnTypes(SchemaToTypeInfo.java:87)
	at org.apache.hadoop.hive.serde2.avro.AvroObjectInspectorGenerator.<init>(AvroObjectInspectorGenerator.java:53)
	at org.apache.hadoop.hive.serde2.avro.AvroSerDe.initialize(AvroSerDe.java:131)
	at org.apache.hadoop.hive.serde2.avro.AvroSerDe.initialize(AvroSerDe.java:83)
	at org.apache.hadoop.hive.serde2.SerDeUtils.initializeSerDe(SerDeUtils.java:533)
	at org.apache.hadoop.hive.metastore.MetaStoreUtils.getDeserializer(MetaStoreUtils.java:450)
	at org.apache.hadoop.hive.metastore.MetaStoreUtils.getDeserializer(MetaStoreUtils.java:437)
	at org.apache.hadoop.hive.ql.metadata.Table.getDeserializerFromMetaStore(Table.java:281)
	at org.apache.hadoop.hive.ql.metadata.Table.getDeserializer(Table.java:263)
	at org.apache.hadoop.hive.ql.metadata.Table.getColsInternal(Table.java:641)
	at org.apache.hadoop.hive.ql.metadata.Table.getCols(Table.java:624)
	at org.apache.spark.sql.hive.HiveUtils$.inferSchema(HiveUtils.scala:497)
	at org.apache.spark.sql.hive.ResolveHiveSerdeTable$$anonfun$apply$1.applyOrElse(HiveStrategies.scala:100)
	at org.apache.spark.sql.hive.ResolveHiveSerdeTable$$anonfun$apply$1.applyOrElse(HiveStrategies.scala:88)
	at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.$anonfun$resolveOperatorsDown$2(AnalysisHelper.scala:108)
	at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:72)
	at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.$anonfun$resolveOperatorsDown$1(AnalysisHelper.scala:108)
	at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$.allowInvokingTransformsInAnalyzer(AnalysisHelper.scala:194)
	at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.resolveOperatorsDown(AnalysisHelper.scala:106)
	at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.resolveOperatorsDown$(AnalysisHelper.scala:104)
	at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveOperatorsDown(LogicalPlan.scala:29)
	at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.resolveOperators(AnalysisHelper.scala:73)
	at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.resolveOperators$(AnalysisHelper.scala:72)
	at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveOperators(LogicalPlan.scala:29)
	at org.apache.spark.sql.hive.ResolveHiveSerdeTable.apply(HiveStrategies.scala:88)
	at org.apache.spark.sql.hive.ResolveHiveSerdeTable.apply(HiveStrategies.scala:42)
	at org.apache.spark.sql.catalyst.rules.RuleExecutor.$anonfun$execute$2(RuleExecutor.scala:143)
	at scala.collection.LinearSeqOptimized.foldLeft(LinearSeqOptimized.scala:126)
	at scala.collection.LinearSeqOptimized.foldLeft$(LinearSeqOptimized.scala:122)
	at scala.collection.immutable.List.foldLeft(List.scala:89)
	at org.apache.spark.sql.catalyst.rules.RuleExecutor.$anonfun$execute$1(RuleExecutor.scala:140)
	at org.apache.spark.sql.catalyst.rules.RuleExecutor.$anonfun$execute$1$adapted(RuleExecutor.scala:132)
	at scala.collection.immutable.List.foreach(List.scala:392)
	at org.apache.spark.sql.catalyst.rules.RuleExecutor.execute(RuleExecutor.scala:132)
	at org.apache.spark.sql.catalyst.analysis.Analyzer.org$apache$spark$sql$catalyst$analysis$Analyzer$$executeSameContext(Analyzer.scala:175)
	at org.apache.spark.sql.catalyst.analysis.Analyzer.execute(Analyzer.scala:169)
	at org.apache.spark.sql.catalyst.analysis.Analyzer.execute(Analyzer.scala:129)
	at org.apache.spark.sql.catalyst.rules.RuleExecutor.$anonfun$executeAndTrack$1(RuleExecutor.scala:111)
	at org.apache.spark.sql.catalyst.QueryPlanningTracker$.withTracker(QueryPlanningTracker.scala:88)
	at org.apache.spark.sql.catalyst.rules.RuleExecutor.executeAndTrack(RuleExecutor.scala:111)
	at org.apache.spark.sql.catalyst.analysis.Analyzer.$anonfun$executeAndCheck$1(Analyzer.scala:153)
	at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$.markInAnalyzer(AnalysisHelper.scala:201)
	at org.apache.spark.sql.catalyst.analysis.Analyzer.executeAndCheck(Analyzer.scala:152)
	at org.apache.spark.sql.hive.test.TestHiveQueryExecution.analyzed$lzycompute(TestHive.scala:606)
	at org.apache.spark.sql.hive.test.TestHiveQueryExecution.analyzed(TestHive.scala:589)
	at org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:58)
	at org.apache.spark.sql.Dataset$.$anonfun$ofRows$1(Dataset.scala:88)
	at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:762)
	at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:86)
	at org.apache.spark.sql.hive.test.TestHiveSparkSession.sql(TestHive.scala:238)
	at org.apache.spark.sql.SQLContext.sql(SQLContext.scala:550)
	at org.apache.spark.sql.hive.client.VersionsSuite.$anonfun$new$111(VersionsSuite.scala:949)
	at org.apache.spark.sql.hive.client.VersionsSuite.withTable(VersionsSuite.scala:65)
	at org.apache.spark.sql.hive.client.VersionsSuite.$anonfun$new$110(VersionsSuite.scala:938)
	at org.apache.spark.sql.hive.client.VersionsSuite.$anonfun$new$110$adapted(VersionsSuite.scala:937)
	at scala.collection.immutable.List.foreach(List.scala:392)
	at org.apache.spark.sql.hive.client.VersionsSuite.$anonfun$new$109(VersionsSuite.scala:937)
	at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
	at org.scalatest.OutcomeOf.outcomeOf(OutcomeOf.scala:85)
	at org.scalatest.OutcomeOf.outcomeOf$(OutcomeOf.scala:83)
	at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
	at org.scalatest.Transformer.apply(Transformer.scala:22)
	at org.scalatest.Transformer.apply(Transformer.scala:20)
	at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:186)
	at org.apache.spark.SparkFunSuite.withFixture(SparkFunSuite.scala:151)
	at org.scalatest.FunSuiteLike.invokeWithFixture$1(FunSuiteLike.scala:184)
	at org.scalatest.FunSuiteLike.$anonfun$runTest$1(FunSuiteLike.scala:196)
	at org.scalatest.SuperEngine.runTestImpl(Engine.scala:286)
	at org.scalatest.FunSuiteLike.runTest(FunSuiteLike.scala:196)
	at org.scalatest.FunSuiteLike.runTest$(FunSuiteLike.scala:178)
	at org.apache.spark.SparkFunSuite.org$scalatest$BeforeAndAfterEach$$super$runTest(SparkFunSuite.scala:58)
	at org.scalatest.BeforeAndAfterEach.runTest(BeforeAndAfterEach.scala:221)
	at org.scalatest.BeforeAndAfterEach.runTest$(BeforeAndAfterEach.scala:214)
	at org.apache.spark.SparkFunSuite.runTest(SparkFunSuite.scala:58)
	at org.scalatest.FunSuiteLike.$anonfun$runTests$1(FunSuiteLike.scala:229)
	at org.scalatest.SuperEngine.$anonfun$runTestsInBranch$1(Engine.scala:393)
	at scala.collection.immutable.List.foreach(List.scala:392)
	at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:381)
	at org.scalatest.SuperEngine.runTestsInBranch(Engine.scala:376)
	at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:458)
	at org.scalatest.FunSuiteLike.runTests(FunSuiteLike.scala:229)
	at org.scalatest.FunSuiteLike.runTests$(FunSuiteLike.scala:228)
	at org.scalatest.FunSuite.runTests(FunSuite.scala:1560)
	at org.scalatest.Suite.run(Suite.scala:1124)
	at org.scalatest.Suite.run$(Suite.scala:1106)
	at org.scalatest.FunSuite.org$scalatest$FunSuiteLike$$super$run(FunSuite.scala:1560)
	at org.scalatest.FunSuiteLike.$anonfun$run$1(FunSuiteLike.scala:233)
	at org.scalatest.SuperEngine.runImpl(Engine.scala:518)
	at org.scalatest.FunSuiteLike.run(FunSuiteLike.scala:233)
	at org.scalatest.FunSuiteLike.run$(FunSuiteLike.scala:232)
	at org.apache.spark.SparkFunSuite.org$scalatest$BeforeAndAfterAll$$super$run(SparkFunSuite.scala:58)
	at org.scalatest.BeforeAndAfterAll.liftedTree1$1(BeforeAndAfterAll.scala:213)
	at org.scalatest.BeforeAndAfterAll.run(BeforeAndAfterAll.scala:210)
	at org.scalatest.BeforeAndAfterAll.run$(BeforeAndAfterAll.scala:208)
	at org.apache.spark.SparkFunSuite.run(SparkFunSuite.scala:58)
	at org.scalatest.tools.Framework.org$scalatest$tools$Framework$$runSuite(Framework.scala:317)
	at org.scalatest.tools.Framework$ScalaTestTask.execute(Framework.scala:510)
	at sbt.ForkMain$Run$2.call(ForkMain.java:296)
	at sbt.ForkMain$Run$2.call(ForkMain.java:286)
	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)

@iemejia
Copy link
Member Author

iemejia commented Feb 19, 2020

Yes @Fokkko and that's something I don't know if we can workaround, because this is a transitive dependency. That's the whole point of my arguments to upgrade the dependency in Hive and catch with the upgraded version here on Spark HIVE-21737. But then we will still need the same kind of fix for the fork of Hive 1.x that Spark depends on (I don't even know where the code of that one is). Or am I a missing a better fix?

@dongjoon-hyun
Copy link
Member

Thank you for making a PR, @iemejia .
However, it seems that this PR is still under development because it didn't pass the unit tests.
I'll close this PR for now. Could you reopen this after you makes all UTs locally?

@iemejia
Copy link
Member Author

iemejia commented Feb 19, 2020

Thanks @dongjoon-hyun my goal with this PR was to share the work and show you the issues. Since this is definitely out of my hands as you can see. I hope you or someone else in the Spark PMC has contacts with the Hive people to see if we can untangle this mess together. Don't hesitate to ping me if you need me to reopen this or join that discussion.

@dongjoon-hyun
Copy link
Member

Thanks, @iemejia .

@Karl-WangSK
Copy link
Contributor

Any updates for avro version in spark?

@iemejia
Copy link
Member Author

iemejia commented Oct 30, 2020

We have a recent progress, we need still a release of the hive dependencies if you want to follow the 'action' more details here https://issues.apache.org/jira/browse/HIVE-21737

@wangyum
Copy link
Member

wangyum commented Jan 18, 2021

@iemejia We can now upgrade Avro since the built-in Hive has been upgraded to 2.3.8.

@dongjoon-hyun
Copy link
Member

+1 for @wangyum 's comment.

@iemejia
Copy link
Member Author

iemejia commented Jan 18, 2021

Thanks for the awareness @wangyum I am going to reopen and rebase this, let's see...

@iemejia
Copy link
Member Author

iemejia commented Jan 18, 2021

I opened #31232 that takes some of your changes on #30517 @wangyum + some missing of mine. Let's see if that one goes green

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants