Skip to content

Incorrect behavior when reading unsigned integers from Parquet #1067

@andygrove

Description

@andygrove

Describe the bug

In makeParquetFileAllTypes we create some columns with unsigned types:

  optional int32                    _9(UINT_8);
  optional int32                    _10(UINT_16);
  optional int32                    _11(UINT_32);
  optional int64                    _12(UINT_64);

In existing tests, we have TODO comments and skip testing these columns. For example, in CometExpressionSuite we have:

  test("round") {
    ...
        withParquetTable(path.toString, "tbl") {
          for (s <- Seq(-5, -1, 0, 1, 5, -1000, 1000, -323, -308, 308, -15, 15, -16, 16, null)) {
            // array tests
            // TODO: enable test for unsigned ints (_9, _10, _11, _12)
            // TODO: enable test for floats (_6, _7, _8, _13)
            for (c <- Seq(2, 3, 4, 5, 9, 15, 16, 17)) {
              checkSparkAnswerAndOperator(s"select _${c}, round(_${c}, ${s}) FROM tbl")
            }

I tried enabling these tests and see that it results in incorrect results.

== Results ==
!== Correct Answer - 356 ==                   == Spark Answer - 356 ==
 struct<_9:smallint,round(_9, -5):smallint>   struct<_9:smallint,round(_9, -5):smallint>
![-1,0]                                       [0,0]
![-100,0]                                     [0,0]
![-103,0]                                     [0,0]
![-105,0]                                     [0,0]
...

Steps to reproduce

No response

Expected behavior

No response

Additional context

No response

Metadata

Metadata

Labels

bugSomething isn't working

Type

No type

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions