Skip to content

Reading records inserted using Athena throws UOE exception when read using Spark (AWS) #5593

Closed
@a-agmon

Description

@a-agmon

Apache Iceberg version

0.14.0 (latest release)

Query engine

Spark

Please describe the bug 🐞

Description

  1. An Iceberg table has been created and populated in a Spark job (using GlueCatalog)
  2. Additional data has been inserted to the table using Athena console or API
  3. When I attempt to read new data from the table using Spark the following exception is thrown
22/08/20 22:16:05 ERROR BaseDataReader: Error reading file: s3://****/temp-testing-aa/default.db/iceberg_demo3/data/category=CAT5/d53b94be-8d2d-4892-b020-b60610a30318.gz.parquet
java.lang.UnsupportedOperationException: Cannot support vectorized reads for column [email] optional binary email (STRING) = 1 with encoding DELTA_BYTE_ARRAY. Disable vectorized reads to read this table/file
	at org.apache.iceberg.arrow.vectorized.parquet.VectorizedPageIterator.initDataReader(VectorizedPageIterator.java:96)

This does not happen when the same data is inserted using Spark

Reproduce:

1. Run this spark code to create table and insert some data

    val spark  = SparkSession.builder()
      .appName("DataExtractor")
      .config("spark.master", "local[*]")
      .config("spark.driver.host", "127.0.0.1")
      .config("spark.driver.bindAddress", "127.0.0.1")

      .config("spark.hadoop.fs.s3.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem")
      .config("spark.sql.catalog.my_catalog", "org.apache.iceberg.spark.SparkCatalog")
      .config("spark.sql.catalog.my_catalog.warehouse", "s3://*****/temp-testing-aa")
      .config ("spark.sql.catalog.my_catalog.catalog-impl", "org.apache.iceberg.aws.glue.GlueCatalog")
      .config("spark.sql.catalog.my_catalog.io-impl", "org.apache.iceberg.aws.s3.S3FileIO")
      .getOrCreate()

    val icebergDDL =
      """
        |CREATE TABLE IF NOT EXISTS my_catalog.default.iceberg_demo (
        |    email string,
        |    data string,
        |    category string)
        |USING iceberg
        |PARTITIONED BY (category);
        |""".stripMargin

    spark.sql(icebergDDL)

    val icebergInsert =
      """
        |INSERT INTO my_catalog.default.iceberg_demo (email, data, category)
        |VALUES
        |('AABB1', 'data 1' , 'CAT1'),
        |('AABB2', 'data 2' , 'CAT1'),
        |('AABB3', 'data 3' , 'CAT2'),
        |('AABB4', 'data 4' , 'CAT3')

        |""".stripMargin

    spark.sql(icebergInsert)
    val df = spark.table("my_catalog.default.iceberg_demo")
    df.show()

Table has been created and viewable from Athena and Spark.

2. Use Athena to insert a record

INSERT INTO iceberg_demo (email, data, category)  VALUES ('AABB5', 'data 5' , 'CAT5')

3. Run a spark job again with the following line

    val df = spark.table("my_catalog.default.iceberg_demo")
    df.show()

5. The following exception is thrown

22/08/20 22:48:26 INFO CatalogUtil: Loading custom FileIO implementation: org.apache.iceberg.aws.s3.S3FileIO
22/08/20 22:48:27 INFO BaseMetastoreTableOperations: Refreshing table metadata from new version: s3://***/temp-testing-aa/default.db/iceberg_demo/metadata/00002-59d72a1f-8b73-4386-bb86-3b6dfc29e858.metadata.json
22/08/20 22:48:28 INFO BaseMetastoreCatalog: Table loaded by catalog: my_catalog.default.iceberg_demo
22/08/20 22:48:28 INFO V2ScanRelationPushDown: 
Output: email#0, data#1, category#2
......
.....
22/08/20 22:48:32 ERROR BaseDataReader: Error reading file: s3://****/temp-testing-aa/default.db/iceberg_demo/data/category=CAT5/ff78aec0-344e-489a-9206-ab977cd9897b.gz.parquet
java.lang.UnsupportedOperationException: Cannot support vectorized reads for column [email] optional binary email (STRING) = 1 with encoding DELTA_BYTE_ARRAY. Disable vectorized reads to read this table/file
	at org.apache.iceberg.arrow.vectorized.parquet.VectorizedPageIterator.initDataReader(VectorizedPageIterator.java:96)
	at org.apache.iceberg.parquet.BasePageIterator.initFromPage(BasePageIterator.java:136)
	at org.apache.iceberg.parquet.BasePageIterator$1.visit(BasePageIterator.java:103)
	at org.apache.iceberg.parquet.BasePageIterator$1.visit(BasePageIterator.java:94)

This also happens when I try to use
config("spark.sql.parquet.enableVectorizedReader", "false")

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions