[DO-NOT-MERGE] Test compatibility for Parquet 1.11.1, Avro 1.10.1 and Hive 2.3.8 #30517

wangyum · 2020-11-26T13:37:14Z

What changes were proposed in this pull request?

Test compatibility for Parquet 1.11.1, Avro 1.10.1 and Hive 2.3.8.
Benchmark Parquet column index: [DO-NOT-MERGE] Test compatibility for Parquet 1.11.1, Avro 1.10.1 and Hive 2.3.8 #30517 (comment).

Building a runnable distribution to test:

git clone https://github.com/apache/spark.git && cd spark
git fetch origin pull/30517/head:parquet-1.11.1
git checkout parquet-1.11.1

./dev/make-distribution.sh --name parquet-1.11.1 --tgz  -Phive -Phive-thriftserver -Pmesos -Pyarn -Pkubernetes -Phadoop-2.7

wangyum · 2020-11-26T14:02:00Z

retest this please.

SparkQA · 2020-11-26T14:12:32Z

Test build #131846 has finished for PR 30517 at commit 129b468.

This patch fails RAT tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-11-26T14:37:04Z

Test build #131847 has finished for PR 30517 at commit 23b0bba.

This patch fails to build.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-11-26T15:28:42Z

Test build #131849 has finished for PR 30517 at commit d0573a3.

This patch fails to build.
This patch merges cleanly.
This patch adds no public classes.

wangyum · 2020-11-26T15:38:37Z

Failed to download Hive 2.3.8 from my GitHub repository:

Downloading from github: https://maven.pkg.github.com/wangyum/hive/org/apache/hive/hive-common/2.3.8-SNAPSHOT/maven-metadata.xml
Downloading from apache.snapshots: https://repository.apache.org/snapshots/org/apache/hive/hive-common/2.3.8-SNAPSHOT/maven-metadata.xml
[WARNING] Could not transfer metadata org.apache.hive:hive-common:2.3.8-SNAPSHOT/maven-metadata.xml from/to github (https://maven.pkg.github.com/wangyum/hive): Authentication failed for https://maven.pkg.github.com/wangyum/hive/org/apache/hive/hive-common/2.3.8-SNAPSHOT/maven-metadata.xml 401 Unauthorized

wangyum · 2020-11-26T23:34:15Z

- create hive serde table with Catalog
*** RUN ABORTED ***
  java.lang.NoSuchMethodError: 'void org.apache.avro.Schema$Field.<init>(java.lang.String, org.apache.avro.Schema, java.lang.String, org.codehaus.jackson.JsonNode)'
  at org.apache.hadoop.hive.serde2.avro.TypeInfoToSchema.createAvroField(TypeInfoToSchema.java:76)
  at org.apache.hadoop.hive.serde2.avro.TypeInfoToSchema.convert(TypeInfoToSchema.java:61)
  at org.apache.hadoop.hive.serde2.avro.AvroSerDe.getSchemaFromCols(AvroSerDe.java:170)
  at org.apache.hadoop.hive.serde2.avro.AvroSerDe.initialize(AvroSerDe.java:114)
  at org.apache.hadoop.hive.serde2.avro.AvroSerDe.initialize(AvroSerDe.java:83)
  at org.apache.hadoop.hive.serde2.SerDeUtils.initializeSerDe(SerDeUtils.java:533)
  at org.apache.hadoop.hive.metastore.MetaStoreUtils.getDeserializer(MetaStoreUtils.java:450)
  at org.apache.hadoop.hive.metastore.MetaStoreUtils.getDeserializer(MetaStoreUtils.java:437)
  at org.apache.hadoop.hive.ql.metadata.Table.getDeserializerFromMetaStore(Table.java:281)
  at org.apache.hadoop.hive.ql.metadata.Table.getDeserializer(Table.java:263)

dongjoon-hyun · 2020-11-27T01:23:34Z

Can we try with Apache Parquet 1.11.1?

wangyum · 2020-11-27T01:54:33Z

Can we try with Apache Parquet 1.11.1?

Yes. it's Parquet 1.11.1. I forgot to change the title.

wangyum · 2020-11-27T03:16:06Z

- alter hive serde table add columns -- partitioned - AVRO *** FAILED ***
  org.apache.spark.sql.AnalysisException: org.apache.hadoop.hive.ql.metadata.HiveException: org.apache.avro.AvroRuntimeException: Unknown datum class: class org.codehaus.jackson.node.NullNode;
  at org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:112)
  at org.apache.spark.sql.hive.HiveExternalCatalog.createTable(HiveExternalCatalog.scala:245)
  at org.apache.spark.sql.catalyst.catalog.ExternalCatalogWithListener.createTable(ExternalCatalogWithListener.scala:94)
  at org.apache.spark.sql.catalyst.catalog.SessionCatalog.createTable(SessionCatalog.scala:346)
  at org.apache.spark.sql.execution.command.CreateTableCommand.run(tables.scala:166)
  at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70)
  at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68)
  at org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:79)
  at org.apache.spark.sql.Dataset.$anonfun$logicalPlan$1(Dataset.scala:228)
  at org.apache.spark.sql.Dataset.$anonfun$withAction$1(Dataset.scala:3680)

SparkQA · 2020-11-27T07:50:35Z

Test build #131867 has finished for PR 30517 at commit 5937a41.

This patch fails to build.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-11-27T08:36:11Z

Test build #131869 has finished for PR 30517 at commit 8815b0c.

This patch fails to build.
This patch merges cleanly.
This patch adds no public classes.

sunchao · 2020-11-28T19:01:36Z

Thanks @wangyum for working on this! did you encounter any other issue besides the NullNode one?

wangyum · 2020-11-29T09:43:52Z

sql/hive, sql/thriftserver and external/avro should be fine. sql/core has some issues, e.g.:

mvn -Dtest=none -DwildcardSuites=org.apache.spark.sql.execution.datasources.parquet.ParquetV2SchemaPruningSuite test

- Spark vectorized reader - with partition data column - select nullable complex field and having is not null predicate *** FAILED ***
  Results do not match for query:
  Timezone: sun.util.calendar.ZoneInfo[id="America/Los_Angeles",offset=-28800000,dstSavings=3600000,useDaylight=true,transitions=185,lastRule=java.util.SimpleTimeZone[id=America/Los_Angeles,offset=-28800000,dstSavings=3600000,useDaylight=true,startYear=0,startMode=3,startMonth=2,startDay=8,startDayOfWeek=1,startTime=7200000,startTimeMode=0,endMode=3,endMonth=10,endDay=1,endDayOfWeek=1,endTime=7200000,endTimeMode=0]]
  Timezone Env:

  == Parsed Logical Plan ==
  'Project ['employer.company]
  +- 'Filter (isnotnull('employer) AND ('p = 1))
     +- 'UnresolvedRelation [contacts], [], false

  == Analyzed Logical Plan ==
  company: struct<name:string,address:string>
  Project [employer#7739.company AS company#7772]
  +- Filter (isnotnull(employer#7739) AND (p#7741 = 1))
     +- SubqueryAlias contacts
        +- RelationV2[id#7733, name#7734, address#7735, pets#7736, friends#7737, relatives#7738, employer#7739, relations#7740, p#7741] parquet file:/root/opensource/spark/sql/core/target/tmp/spark-bdb1b34b-cf6a-462d-8caa-fcd923df3fe3/contacts

  == Optimized Logical Plan ==
  Project [employer#7739.company AS company#7772]
  +- Filter isnotnull(employer#7739)
     +- RelationV2[employer#7739, p#7741] parquet file:/root/opensource/spark/sql/core/target/tmp/spark-bdb1b34b-cf6a-462d-8caa-fcd923df3fe3/contacts

  == Physical Plan ==
  *(1) Project [employer#7739.company AS company#7772]
  +- *(1) Filter isnotnull(employer#7739)
     +- BatchScan[employer#7739, p#7741] ParquetScan DataFilters: [isnotnull(employer#7739)], Format: parquet, Location: InMemoryFileIndex[file:/root/opensource/spark/sql/core/target/tmp/spark-bdb1b34b-cf6a-462d-8caa-f..., PartitionFilters: [isnotnull(p#7741), (p#7741 = 1)], PushedFilers: [IsNotNull(p), EqualTo(p,1)], ReadSchema: struct<employer:struct<company:struct<name:string,address:string>>>, PushedFilters: [IsNotNull(p), EqualTo(p,1)]

  == Results ==

  == Results ==
  !== Correct Answer - 2 ==      == Spark Answer - 0 ==
   struct<>                      struct<>
  ![[abc,123 Business Street]]
  ![null] (QueryTest.scala:243)

wangyum · 2020-12-02T01:28:24Z

retest this please.

SparkQA · 2020-12-02T05:50:31Z

Test build #132010 has finished for PR 30517 at commit c617757.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-12-02T14:28:41Z

Test build #132027 has finished for PR 30517 at commit 26badc4.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

wangyum · 2020-12-02T23:40:07Z

- support user provided non-nullable avro schema for nullable catalyst schema without any null record *** FAILED ***
  "Job aborted due to stage failure: Task 1 in stage 131.0 failed 1 times, most recent failure: Lost task 1.0 in stage 131.0 (TID 238) (192.168.10.30 executor driver): org.apache.spark.SparkException: Task failed while writing rows.
  	at org.apache.spark.sql.execution.datasources.FileFormatWriter$.executeTask(FileFormatWriter.scala:296)
  	at org.apache.spark.sql.execution.datasources.FileFormatWriter$.$anonfun$write$15(FileFormatWriter.scala:210)
  	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
  	at org.apache.spark.scheduler.Task.run(Task.scala:131)
  	at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:497)
  	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1439)
  	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:500)
  	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
  	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
  	at java.lang.Thread.run(Thread.java:748)

SparkQA · 2020-12-03T03:16:39Z

Test build #132084 has started for PR 30517 at commit 795c276.

SparkQA · 2020-12-03T04:14:49Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/36683/

SparkQA · 2020-12-03T05:40:15Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/36683/

SparkQA · 2020-12-03T08:57:04Z

Test build #132075 has finished for PR 30517 at commit c4068ce.

This patch fails from timeout after a configured wait of 500m.
This patch does not merge cleanly.
This patch adds no public classes.

SparkQA · 2020-12-03T16:25:11Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/36744/

heuermh · 2021-01-06T19:24:17Z

With the changes in bigdatagenomics/adam#2289 to remove various workarounds, this pull request works for us.

Non-binding +1

wangyum · 2021-01-08T09:20:40Z

retest this please

SparkQA · 2021-01-08T10:11:27Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/38422/

SparkQA · 2021-01-08T10:35:57Z

Test build #133833 has finished for PR 30517 at commit 7ffbd9d.

This patch fails to build.
This patch does not merge cleanly.
This patch adds no public classes.

SparkQA · 2021-01-08T10:38:53Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/38422/

# Conflicts: # dev/deps/spark-deps-hadoop-2.7-hive-2.3 # dev/deps/spark-deps-hadoop-3.2-hive-2.3 # pom.xml # sql/core/src/test/scala/org/apache/spark/sql/ExplainSuite.scala

SparkQA · 2021-01-08T13:03:36Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/38430/

SparkQA · 2021-01-08T13:30:42Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/38430/

SparkQA · 2021-01-08T13:57:33Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/38431/

SparkQA · 2021-01-08T13:57:36Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/38431/

SparkQA · 2021-01-08T15:19:46Z

Test build #133841 has finished for PR 30517 at commit 3a2a6ad.

This patch passes all tests.
This patch does not merge cleanly.
This patch adds no public classes.

SparkQA · 2021-01-08T15:39:43Z

Test build #133842 has finished for PR 30517 at commit 9425b1b.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2021-01-09T05:00:56Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/38451/

SparkQA · 2021-01-09T05:29:13Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/38451/

SparkQA · 2021-01-09T07:21:10Z

Test build #133862 has finished for PR 30517 at commit c1e98af.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

### What changes were proposed in this pull request? Hive 2.3.8 changes: HIVE-19662: Upgrade Avro to 1.8.2 HIVE-24324: Remove deprecated API usage from Avro HIVE-23980: Shade Guava from hive-exec in Hive 2.3 HIVE-24436: Fix Avro NULL_DEFAULT_VALUE compatibility issue HIVE-24512: Exclude calcite in packaging. HIVE-22708: Fix for HttpTransport to replace String.equals HIVE-24551: Hive should include transitive dependencies from calcite after shading it HIVE-24553: Exclude calcite from test-jar dependency of hive-exec ### Why are the changes needed? Upgrade Avro and Parquet to latest version. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Existing test add test try to upgrade Parquet to 1.11.1 and Avro to 1.10.1: #30517 Closes #30657 from wangyum/SPARK-33696. Authored-by: Yuming Wang <yumwang@ebay.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>

### What changes were proposed in this pull request? Hive 2.3.8 changes: HIVE-19662: Upgrade Avro to 1.8.2 HIVE-24324: Remove deprecated API usage from Avro HIVE-23980: Shade Guava from hive-exec in Hive 2.3 HIVE-24436: Fix Avro NULL_DEFAULT_VALUE compatibility issue HIVE-24512: Exclude calcite in packaging. HIVE-22708: Fix for HttpTransport to replace String.equals HIVE-24551: Hive should include transitive dependencies from calcite after shading it HIVE-24553: Exclude calcite from test-jar dependency of hive-exec ### Why are the changes needed? Upgrade Avro and Parquet to latest version. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Existing test add test try to upgrade Parquet to 1.11.1 and Avro to 1.10.1: apache#30517 Closes apache#30657 from wangyum/SPARK-33696. Authored-by: Yuming Wang <yumwang@ebay.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>

Hive 2.3.8 changes: HIVE-19662: Upgrade Avro to 1.8.2 HIVE-24324: Remove deprecated API usage from Avro HIVE-23980: Shade Guava from hive-exec in Hive 2.3 HIVE-24436: Fix Avro NULL_DEFAULT_VALUE compatibility issue HIVE-24512: Exclude calcite in packaging. HIVE-22708: Fix for HttpTransport to replace String.equals HIVE-24551: Hive should include transitive dependencies from calcite after shading it HIVE-24553: Exclude calcite from test-jar dependency of hive-exec Upgrade Avro and Parquet to latest version. No. Existing test add test try to upgrade Parquet to 1.11.1 and Avro to 1.10.1: apache#30517 Closes apache#30657 from wangyum/SPARK-33696. Authored-by: Yuming Wang <yumwang@ebay.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>

github-actions bot added BUILD CORE SQL STRUCTURED STREAMING labels Nov 26, 2020

wangyum changed the title ~~[DO-NOT-MERGE] Test compatibility for Parquet 1.11.0, Avro 1.10.0 and Hive 2.3.8~~ [DO-NOT-MERGE][test-maven] Test compatibility for Parquet 1.11.0, Avro 1.10.0 and Hive 2.3.8 Nov 26, 2020

wangyum changed the title ~~[DO-NOT-MERGE][test-maven] Test compatibility for Parquet 1.11.0, Avro 1.10.0 and Hive 2.3.8~~ [DO-NOT-MERGE][test-maven] Test compatibility for Parquet 1.11.1, Avro 1.10.0 and Hive 2.3.8 Nov 27, 2020

wangyum marked this pull request as draft November 27, 2020 05:51

github-actions bot added the AVRO label Dec 3, 2020

heuermh mentioned this pull request Jan 6, 2021

HIVE-21737: Upgrade Avro to version 1.10.1 apache/hive#1806

Closed

wangyum added 2 commits January 8, 2021 20:10

exclusion

3a2a6ad

Merge remote-tracking branch 'upstream/master' into parquet-avro-hive

9425b1b

# Conflicts: # dev/deps/spark-deps-hadoop-2.7-hive-2.3 # dev/deps/spark-deps-hadoop-3.2-hive-2.3 # pom.xml # sql/core/src/test/scala/org/apache/spark/sql/ExplainSuite.scala

Trigger GithubAction

c1e98af

wangyum mentioned this pull request Jan 18, 2021

[SPARK-33696][BUILD][SQL] Upgrade built-in Hive to 2.3.8 #30657

Closed

iemejia mentioned this pull request Jan 18, 2021

[WIP][SPARK-27733][CORE] Upgrade Avro to 1.9.2 #27609

Closed

heuermh mentioned this pull request Jan 20, 2021

[SPARK-26346][BUILD][SQL] Upgrade Parquet to 1.11.1 #26804

Closed

wangyum closed this Jan 20, 2021

wangyum deleted the parquet-avro-hive branch February 27, 2021 05:16

LorenzoMartini mentioned this pull request Apr 19, 2021

[SPARK-33696][BUILD][SQL] Upgrade built-in Hive to 2.3.8 palantir/spark#756

Merged

[DO-NOT-MERGE] Test compatibility for Parquet 1.11.1, Avro 1.10.1 and Hive 2.3.8 #30517

[DO-NOT-MERGE] Test compatibility for Parquet 1.11.1, Avro 1.10.1 and Hive 2.3.8 #30517

Uh oh!

Conversation

wangyum commented Nov 26, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Building a runnable distribution to test:

Uh oh!

wangyum commented Nov 26, 2020

Uh oh!

SparkQA commented Nov 26, 2020

Uh oh!

SparkQA commented Nov 26, 2020

Uh oh!

SparkQA commented Nov 26, 2020

Uh oh!

wangyum commented Nov 26, 2020

Uh oh!

wangyum commented Nov 26, 2020

Uh oh!

dongjoon-hyun commented Nov 27, 2020

Uh oh!

wangyum commented Nov 27, 2020

Uh oh!

wangyum commented Nov 27, 2020

Uh oh!

SparkQA commented Nov 27, 2020

Uh oh!

SparkQA commented Nov 27, 2020

Uh oh!

sunchao commented Nov 28, 2020

Uh oh!

wangyum commented Nov 29, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

wangyum commented Dec 2, 2020

Uh oh!

SparkQA commented Dec 2, 2020

Uh oh!

SparkQA commented Dec 2, 2020

Uh oh!

wangyum commented Dec 2, 2020

Uh oh!

SparkQA commented Dec 3, 2020

Uh oh!

SparkQA commented Dec 3, 2020

Uh oh!

SparkQA commented Dec 3, 2020

Uh oh!

SparkQA commented Dec 3, 2020

Uh oh!

SparkQA commented Dec 3, 2020

Uh oh!

heuermh commented Jan 6, 2021

Uh oh!

wangyum commented Jan 8, 2021

Uh oh!

SparkQA commented Jan 8, 2021

Uh oh!

SparkQA commented Jan 8, 2021

Uh oh!

SparkQA commented Jan 8, 2021

Uh oh!

SparkQA commented Jan 8, 2021

Uh oh!

SparkQA commented Jan 8, 2021

Uh oh!

SparkQA commented Jan 8, 2021

Uh oh!

SparkQA commented Jan 8, 2021

Uh oh!

SparkQA commented Jan 8, 2021

Uh oh!

SparkQA commented Jan 8, 2021

Uh oh!

SparkQA commented Jan 9, 2021

Uh oh!

SparkQA commented Jan 9, 2021

Uh oh!

wangyum commented Nov 26, 2020 •

edited

Loading

wangyum commented Nov 29, 2020 •

edited

Loading