Skip to content

Support reading DECIMAL(18,2) columns from Parquet #89

Closed
@ash211

Description

@ash211

We're seeing the below stacktrace on Parquet files with the following schema

! java.lang.UnsupportedOperationException: org.apache.parquet.column.values.dictionary.PlainValuesDictionary$PlainBinaryDictionary
! at org.apache.parquet.column.Dictionary.decodeToLong(Dictionary.java:52) ~[parquet-column-1.7.0.jar:1.7.0]
! at org.apache.spark.sql.execution.vectorized.OnHeapColumnVector.getLong(OnHeapColumnVector.java:274) ~[spark-sql_2.11-2.0.1.jar:2.0.1]
! at org.apache.spark.sql.execution.vectorized.ColumnVector.getDecimal(ColumnVector.java:588) ~[spark-sql_2.11-2.0.1.jar:2.0.1]
! at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source) ~[na:na]
! at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) ~[spark-sql_2.11-2.0.1.jar:2.0.1]
! at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370) ~[spark-sql_2.11-2.0.1.jar:2.0.1]
! at org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:246) ~[spark-sql_2.11-2.0.1.jar:2.0.1]
! at org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:240) ~[spark-sql_2.11-2.0.1.jar:2.0.1]
! at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:803) ~[spark-core_2.11-2.0.1.jar:2.0.1]
! at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:803) ~[spark-core_2.11-2.0.1.jar:2.0.1]
! at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) ~[spark-core_2.11-2.0.1.jar:2.0.1]
! at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319) ~[spark-core_2.11-2.0.1.jar:2.0.1]
! at org.apache.spark.rdd.RDD.iterator(RDD.scala:283) ~[spark-core_2.11-2.0.1.jar:2.0.1]
! at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70) ~[spark-core_2.11-2.0.1.jar:2.0.1]
! at org.apache.spark.scheduler.Task.run(Task.scala:86) ~[spark-core_2.11-2.0.1.jar:2.0.1]
! at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274) ~[spark-core_2.11-2.0.1.jar:2.0.1]

Anonymized schema:

message parquetSchema {
  optional int32 COL_1 (INT_32);
  optional int32 COL_2 (INT_32);
  optional binary COL_3 (DECIMAL(18,2));
  optional binary COL_4 (UTF8);
  optional binary COL_5 (UTF8);
  optional int32 COL_6 (INT_32);
  optional int64 COL_7 (TIMESTAMP_MILLIS);
  optional int64 COL_8 (TIMESTAMP_MILLIS);
  optional int32 COL_9 (INT_32);
  optional binary COL_10 (UTF8);
  optional int64 COL_11 (TIMESTAMP_MILLIS);
  optional int64 COL_12 (TIMESTAMP_MILLIS);
}

COL_3 is the one causing problems.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions