Skip to content

[DO-NOT-MERGE] Test compatibility for Parquet 1.11.1, Avro 1.10.1 and Hive 2.3.8 #30517

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 11 commits into from
Closed

[DO-NOT-MERGE] Test compatibility for Parquet 1.11.1, Avro 1.10.1 and Hive 2.3.8 #30517

wants to merge 11 commits into from

Conversation

wangyum
Copy link
Member

@wangyum wangyum commented Nov 26, 2020

What changes were proposed in this pull request?

  1. Test compatibility for Parquet 1.11.1, Avro 1.10.1 and Hive 2.3.8.
  2. Benchmark Parquet column index: [DO-NOT-MERGE] Test compatibility for Parquet 1.11.1, Avro 1.10.1 and Hive 2.3.8  #30517 (comment).

Building a runnable distribution to test:

git clone https://github.com/apache/spark.git && cd spark
git fetch origin pull/30517/head:parquet-1.11.1
git checkout parquet-1.11.1

./dev/make-distribution.sh --name parquet-1.11.1 --tgz  -Phive -Phive-thriftserver -Pmesos -Pyarn -Pkubernetes -Phadoop-2.7

@wangyum wangyum changed the title [DO-NOT-MERGE] Test compatibility for Parquet 1.11.0, Avro 1.10.0 and Hive 2.3.8 [DO-NOT-MERGE][test-maven] Test compatibility for Parquet 1.11.0, Avro 1.10.0 and Hive 2.3.8 Nov 26, 2020
@wangyum
Copy link
Member Author

wangyum commented Nov 26, 2020

retest this please.

@SparkQA
Copy link

SparkQA commented Nov 26, 2020

Test build #131846 has finished for PR 30517 at commit 129b468.

  • This patch fails RAT tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Nov 26, 2020

Test build #131847 has finished for PR 30517 at commit 23b0bba.

  • This patch fails to build.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Nov 26, 2020

Test build #131849 has finished for PR 30517 at commit d0573a3.

  • This patch fails to build.
  • This patch merges cleanly.
  • This patch adds no public classes.

@wangyum
Copy link
Member Author

wangyum commented Nov 26, 2020

Failed to download Hive 2.3.8 from my GitHub repository:

Downloading from github: https://maven.pkg.github.com/wangyum/hive/org/apache/hive/hive-common/2.3.8-SNAPSHOT/maven-metadata.xml
Downloading from apache.snapshots: https://repository.apache.org/snapshots/org/apache/hive/hive-common/2.3.8-SNAPSHOT/maven-metadata.xml
[WARNING] Could not transfer metadata org.apache.hive:hive-common:2.3.8-SNAPSHOT/maven-metadata.xml from/to github (https://maven.pkg.github.com/wangyum/hive): Authentication failed for https://maven.pkg.github.com/wangyum/hive/org/apache/hive/hive-common/2.3.8-SNAPSHOT/maven-metadata.xml 401 Unauthorized

@wangyum
Copy link
Member Author

wangyum commented Nov 26, 2020

- create hive serde table with Catalog
*** RUN ABORTED ***
  java.lang.NoSuchMethodError: 'void org.apache.avro.Schema$Field.<init>(java.lang.String, org.apache.avro.Schema, java.lang.String, org.codehaus.jackson.JsonNode)'
  at org.apache.hadoop.hive.serde2.avro.TypeInfoToSchema.createAvroField(TypeInfoToSchema.java:76)
  at org.apache.hadoop.hive.serde2.avro.TypeInfoToSchema.convert(TypeInfoToSchema.java:61)
  at org.apache.hadoop.hive.serde2.avro.AvroSerDe.getSchemaFromCols(AvroSerDe.java:170)
  at org.apache.hadoop.hive.serde2.avro.AvroSerDe.initialize(AvroSerDe.java:114)
  at org.apache.hadoop.hive.serde2.avro.AvroSerDe.initialize(AvroSerDe.java:83)
  at org.apache.hadoop.hive.serde2.SerDeUtils.initializeSerDe(SerDeUtils.java:533)
  at org.apache.hadoop.hive.metastore.MetaStoreUtils.getDeserializer(MetaStoreUtils.java:450)
  at org.apache.hadoop.hive.metastore.MetaStoreUtils.getDeserializer(MetaStoreUtils.java:437)
  at org.apache.hadoop.hive.ql.metadata.Table.getDeserializerFromMetaStore(Table.java:281)
  at org.apache.hadoop.hive.ql.metadata.Table.getDeserializer(Table.java:263)

@dongjoon-hyun
Copy link
Member

Can we try with Apache Parquet 1.11.1?

@wangyum wangyum changed the title [DO-NOT-MERGE][test-maven] Test compatibility for Parquet 1.11.0, Avro 1.10.0 and Hive 2.3.8 [DO-NOT-MERGE][test-maven] Test compatibility for Parquet 1.11.1, Avro 1.10.0 and Hive 2.3.8 Nov 27, 2020
@wangyum
Copy link
Member Author

wangyum commented Nov 27, 2020

Can we try with Apache Parquet 1.11.1?

Yes. it's Parquet 1.11.1. I forgot to change the title.

@wangyum
Copy link
Member Author

wangyum commented Nov 27, 2020

- alter hive serde table add columns -- partitioned - AVRO *** FAILED ***
  org.apache.spark.sql.AnalysisException: org.apache.hadoop.hive.ql.metadata.HiveException: org.apache.avro.AvroRuntimeException: Unknown datum class: class org.codehaus.jackson.node.NullNode;
  at org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:112)
  at org.apache.spark.sql.hive.HiveExternalCatalog.createTable(HiveExternalCatalog.scala:245)
  at org.apache.spark.sql.catalyst.catalog.ExternalCatalogWithListener.createTable(ExternalCatalogWithListener.scala:94)
  at org.apache.spark.sql.catalyst.catalog.SessionCatalog.createTable(SessionCatalog.scala:346)
  at org.apache.spark.sql.execution.command.CreateTableCommand.run(tables.scala:166)
  at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70)
  at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68)
  at org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:79)
  at org.apache.spark.sql.Dataset.$anonfun$logicalPlan$1(Dataset.scala:228)
  at org.apache.spark.sql.Dataset.$anonfun$withAction$1(Dataset.scala:3680)

@wangyum wangyum marked this pull request as draft November 27, 2020 05:51
@SparkQA
Copy link

SparkQA commented Nov 27, 2020

Test build #131867 has finished for PR 30517 at commit 5937a41.

  • This patch fails to build.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Nov 27, 2020

Test build #131869 has finished for PR 30517 at commit 8815b0c.

  • This patch fails to build.
  • This patch merges cleanly.
  • This patch adds no public classes.

@sunchao
Copy link
Member

sunchao commented Nov 28, 2020

Thanks @wangyum for working on this! did you encounter any other issue besides the NullNode one?

@wangyum
Copy link
Member Author

wangyum commented Nov 29, 2020

sql/hive, sql/thriftserver and external/avro should be fine. sql/core has some issues, e.g.:

mvn -Dtest=none -DwildcardSuites=org.apache.spark.sql.execution.datasources.parquet.ParquetV2SchemaPruningSuite test
- Spark vectorized reader - with partition data column - select nullable complex field and having is not null predicate *** FAILED ***
  Results do not match for query:
  Timezone: sun.util.calendar.ZoneInfo[id="America/Los_Angeles",offset=-28800000,dstSavings=3600000,useDaylight=true,transitions=185,lastRule=java.util.SimpleTimeZone[id=America/Los_Angeles,offset=-28800000,dstSavings=3600000,useDaylight=true,startYear=0,startMode=3,startMonth=2,startDay=8,startDayOfWeek=1,startTime=7200000,startTimeMode=0,endMode=3,endMonth=10,endDay=1,endDayOfWeek=1,endTime=7200000,endTimeMode=0]]
  Timezone Env:

  == Parsed Logical Plan ==
  'Project ['employer.company]
  +- 'Filter (isnotnull('employer) AND ('p = 1))
     +- 'UnresolvedRelation [contacts], [], false

  == Analyzed Logical Plan ==
  company: struct<name:string,address:string>
  Project [employer#7739.company AS company#7772]
  +- Filter (isnotnull(employer#7739) AND (p#7741 = 1))
     +- SubqueryAlias contacts
        +- RelationV2[id#7733, name#7734, address#7735, pets#7736, friends#7737, relatives#7738, employer#7739, relations#7740, p#7741] parquet file:/root/opensource/spark/sql/core/target/tmp/spark-bdb1b34b-cf6a-462d-8caa-fcd923df3fe3/contacts

  == Optimized Logical Plan ==
  Project [employer#7739.company AS company#7772]
  +- Filter isnotnull(employer#7739)
     +- RelationV2[employer#7739, p#7741] parquet file:/root/opensource/spark/sql/core/target/tmp/spark-bdb1b34b-cf6a-462d-8caa-fcd923df3fe3/contacts

  == Physical Plan ==
  *(1) Project [employer#7739.company AS company#7772]
  +- *(1) Filter isnotnull(employer#7739)
     +- BatchScan[employer#7739, p#7741] ParquetScan DataFilters: [isnotnull(employer#7739)], Format: parquet, Location: InMemoryFileIndex[file:/root/opensource/spark/sql/core/target/tmp/spark-bdb1b34b-cf6a-462d-8caa-f..., PartitionFilters: [isnotnull(p#7741), (p#7741 = 1)], PushedFilers: [IsNotNull(p), EqualTo(p,1)], ReadSchema: struct<employer:struct<company:struct<name:string,address:string>>>, PushedFilters: [IsNotNull(p), EqualTo(p,1)]

  == Results ==

  == Results ==
  !== Correct Answer - 2 ==      == Spark Answer - 0 ==
   struct<>                      struct<>
  ![[abc,123 Business Street]]
  ![null] (QueryTest.scala:243)

@wangyum
Copy link
Member Author

wangyum commented Dec 2, 2020

retest this please.

@SparkQA
Copy link

SparkQA commented Dec 2, 2020

Test build #132010 has finished for PR 30517 at commit c617757.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Dec 2, 2020

Test build #132027 has finished for PR 30517 at commit 26badc4.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@wangyum
Copy link
Member Author

wangyum commented Dec 2, 2020

- support user provided non-nullable avro schema for nullable catalyst schema without any null record *** FAILED ***
  "Job aborted due to stage failure: Task 1 in stage 131.0 failed 1 times, most recent failure: Lost task 1.0 in stage 131.0 (TID 238) (192.168.10.30 executor driver): org.apache.spark.SparkException: Task failed while writing rows.
  	at org.apache.spark.sql.execution.datasources.FileFormatWriter$.executeTask(FileFormatWriter.scala:296)
  	at org.apache.spark.sql.execution.datasources.FileFormatWriter$.$anonfun$write$15(FileFormatWriter.scala:210)
  	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
  	at org.apache.spark.scheduler.Task.run(Task.scala:131)
  	at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:497)
  	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1439)
  	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:500)
  	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
  	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
  	at java.lang.Thread.run(Thread.java:748)

@github-actions github-actions bot added the AVRO label Dec 3, 2020
@SparkQA
Copy link

SparkQA commented Dec 3, 2020

Test build #132084 has started for PR 30517 at commit 795c276.

@SparkQA
Copy link

SparkQA commented Dec 3, 2020

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/36683/

@SparkQA
Copy link

SparkQA commented Dec 3, 2020

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/36683/

@SparkQA
Copy link

SparkQA commented Dec 3, 2020

Test build #132075 has finished for PR 30517 at commit c4068ce.

  • This patch fails from timeout after a configured wait of 500m.
  • This patch does not merge cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Dec 3, 2020

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/36744/

@heuermh
Copy link
Contributor

heuermh commented Jan 6, 2021

With the changes in bigdatagenomics/adam#2289 to remove various workarounds, this pull request works for us.

Non-binding +1

@wangyum
Copy link
Member Author

wangyum commented Jan 8, 2021

retest this please

@SparkQA
Copy link

SparkQA commented Jan 8, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/38422/

@SparkQA
Copy link

SparkQA commented Jan 8, 2021

Test build #133833 has finished for PR 30517 at commit 7ffbd9d.

  • This patch fails to build.
  • This patch does not merge cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jan 8, 2021

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/38422/

# Conflicts:
#	dev/deps/spark-deps-hadoop-2.7-hive-2.3
#	dev/deps/spark-deps-hadoop-3.2-hive-2.3
#	pom.xml
#	sql/core/src/test/scala/org/apache/spark/sql/ExplainSuite.scala
@SparkQA
Copy link

SparkQA commented Jan 8, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/38430/

@SparkQA
Copy link

SparkQA commented Jan 8, 2021

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/38430/

@SparkQA
Copy link

SparkQA commented Jan 8, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/38431/

@SparkQA
Copy link

SparkQA commented Jan 8, 2021

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/38431/

@SparkQA
Copy link

SparkQA commented Jan 8, 2021

Test build #133841 has finished for PR 30517 at commit 3a2a6ad.

  • This patch passes all tests.
  • This patch does not merge cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jan 8, 2021

Test build #133842 has finished for PR 30517 at commit 9425b1b.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jan 9, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/38451/

@SparkQA
Copy link

SparkQA commented Jan 9, 2021

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/38451/

@SparkQA
Copy link

SparkQA commented Jan 9, 2021

Test build #133862 has finished for PR 30517 at commit c1e98af.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

dongjoon-hyun pushed a commit that referenced this pull request Jan 18, 2021
### What changes were proposed in this pull request?

Hive 2.3.8 changes:
HIVE-19662: Upgrade Avro to 1.8.2
HIVE-24324: Remove deprecated API usage from Avro
HIVE-23980: Shade Guava from hive-exec in Hive 2.3
HIVE-24436: Fix Avro NULL_DEFAULT_VALUE compatibility issue
HIVE-24512: Exclude calcite in packaging.
HIVE-22708: Fix for HttpTransport to replace String.equals
HIVE-24551: Hive should include transitive dependencies from calcite after shading it
HIVE-24553: Exclude calcite from test-jar dependency of hive-exec

### Why are the changes needed?

Upgrade Avro and Parquet to latest version.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Existing test add test try to upgrade Parquet to 1.11.1 and Avro to 1.10.1: #30517

Closes #30657 from wangyum/SPARK-33696.

Authored-by: Yuming Wang <yumwang@ebay.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
@wangyum wangyum closed this Jan 20, 2021
skestle pushed a commit to skestle/spark that referenced this pull request Feb 3, 2021
### What changes were proposed in this pull request?

Hive 2.3.8 changes:
HIVE-19662: Upgrade Avro to 1.8.2
HIVE-24324: Remove deprecated API usage from Avro
HIVE-23980: Shade Guava from hive-exec in Hive 2.3
HIVE-24436: Fix Avro NULL_DEFAULT_VALUE compatibility issue
HIVE-24512: Exclude calcite in packaging.
HIVE-22708: Fix for HttpTransport to replace String.equals
HIVE-24551: Hive should include transitive dependencies from calcite after shading it
HIVE-24553: Exclude calcite from test-jar dependency of hive-exec

### Why are the changes needed?

Upgrade Avro and Parquet to latest version.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Existing test add test try to upgrade Parquet to 1.11.1 and Avro to 1.10.1: apache#30517

Closes apache#30657 from wangyum/SPARK-33696.

Authored-by: Yuming Wang <yumwang@ebay.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
@wangyum wangyum deleted the parquet-avro-hive branch February 27, 2021 05:16
LorenzoMartini pushed a commit to palantir/spark that referenced this pull request Apr 19, 2021
Hive 2.3.8 changes:
HIVE-19662: Upgrade Avro to 1.8.2
HIVE-24324: Remove deprecated API usage from Avro
HIVE-23980: Shade Guava from hive-exec in Hive 2.3
HIVE-24436: Fix Avro NULL_DEFAULT_VALUE compatibility issue
HIVE-24512: Exclude calcite in packaging.
HIVE-22708: Fix for HttpTransport to replace String.equals
HIVE-24551: Hive should include transitive dependencies from calcite after shading it
HIVE-24553: Exclude calcite from test-jar dependency of hive-exec

Upgrade Avro and Parquet to latest version.

No.

Existing test add test try to upgrade Parquet to 1.11.1 and Avro to 1.10.1: apache#30517

Closes apache#30657 from wangyum/SPARK-33696.

Authored-by: Yuming Wang <yumwang@ebay.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
LorenzoMartini pushed a commit to palantir/spark that referenced this pull request Apr 19, 2021
Hive 2.3.8 changes:
HIVE-19662: Upgrade Avro to 1.8.2
HIVE-24324: Remove deprecated API usage from Avro
HIVE-23980: Shade Guava from hive-exec in Hive 2.3
HIVE-24436: Fix Avro NULL_DEFAULT_VALUE compatibility issue
HIVE-24512: Exclude calcite in packaging.
HIVE-22708: Fix for HttpTransport to replace String.equals
HIVE-24551: Hive should include transitive dependencies from calcite after shading it
HIVE-24553: Exclude calcite from test-jar dependency of hive-exec

Upgrade Avro and Parquet to latest version.

No.

Existing test add test try to upgrade Parquet to 1.11.1 and Avro to 1.10.1: apache#30517

Closes apache#30657 from wangyum/SPARK-33696.

Authored-by: Yuming Wang <yumwang@ebay.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
16pierre pushed a commit to 16pierre/spark that referenced this pull request May 24, 2021
Hive 2.3.8 changes:
HIVE-19662: Upgrade Avro to 1.8.2
HIVE-24324: Remove deprecated API usage from Avro
HIVE-23980: Shade Guava from hive-exec in Hive 2.3
HIVE-24436: Fix Avro NULL_DEFAULT_VALUE compatibility issue
HIVE-24512: Exclude calcite in packaging.
HIVE-22708: Fix for HttpTransport to replace String.equals
HIVE-24551: Hive should include transitive dependencies from calcite after shading it
HIVE-24553: Exclude calcite from test-jar dependency of hive-exec

Upgrade Avro and Parquet to latest version.

No.

Existing test add test try to upgrade Parquet to 1.11.1 and Avro to 1.10.1: apache#30517

Closes apache#30657 from wangyum/SPARK-33696.

Authored-by: Yuming Wang <yumwang@ebay.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants