Skip to content

[C++] data corruption when using group_by and aggregate on large data sets #36295

@adams-brian

Description

@adams-brian

Describe the bug, including details regarding any error messages, version, and platform.

We recently found some data corruption issues when using group_by and aggregate with large data sets.

I was able to create a minimal reproducible example:

import pyarrow as pa
import pyarrow.compute as pc

COLUMN_COUNT = 5  # <- 4 works fine, 5 causes data corruption
LENGTH = 100_000_000
data = {}
# 'index' = [0, 1, 2, ... , LENGTH-2, LENGTH-1]
data['index'] = pc.indices_nonzero(pc.if_else(True, True, pa.nulls(LENGTH, pa.bool_())))
for i in range(COLUMN_COUNT):
    # fill 'i' with i (ex: '3' = [3, 3, 3, ... 3, 3])
    data[f'{i}'] = pa.nulls(LENGTH, pa.uint64()).fill_null(i)
t = pa.table(data)  # <- create table from data
print('-------------------- ORIGINAL --------------------')
print(t)
a = t.group_by(t.column_names).aggregate([])  # <- should behave like a no-op
a = a.combine_chunks()  # <- not necessary, just improves the print formatting
print('-------------- GROUP_BY / AGGREGATE --------------')
print(a)

In this example the group_by and aggregate are set up to behave like a no-op and everything works fine with COLUMN_COUNT <= 4:

-------------------- ORIGINAL --------------------
pyarrow.Table
index: uint64
0: uint64
1: uint64
2: uint64
3: uint64
----
index: [[0,1,2,3,4,...,99999995,99999996,99999997,99999998,99999999]]
0: [[0,0,0,0,0,...,0,0,0,0,0]]
1: [[1,1,1,1,1,...,1,1,1,1,1]]
2: [[2,2,2,2,2,...,2,2,2,2,2]]
3: [[3,3,3,3,3,...,3,3,3,3,3]]
-------------- GROUP_BY / AGGREGATE --------------
pyarrow.Table
index: uint64
0: uint64
1: uint64
2: uint64
3: uint64
----
index: [[0,1,2,3,4,...,99999995,99999996,99999997,99999998,99999999]]
0: [[0,0,0,0,0,...,0,0,0,0,0]]
1: [[1,1,1,1,1,...,1,1,1,1,1]]
2: [[2,2,2,2,2,...,2,2,2,2,2]]
3: [[3,3,3,3,3,...,3,3,3,3,3]]

...but results in data corruption if COLUMN_COUNT >= 5:

-------------------- ORIGINAL --------------------
pyarrow.Table
index: uint64
0: uint64
1: uint64
2: uint64
3: uint64
4: uint64
----
index: [[0,1,2,3,4,...,99999995,99999996,99999997,99999998,99999999]]
0: [[0,0,0,0,0,...,0,0,0,0,0]]
1: [[1,1,1,1,1,...,1,1,1,1,1]]
2: [[2,2,2,2,2,...,2,2,2,2,2]]
3: [[3,3,3,3,3,...,3,3,3,3,3]]
4: [[4,4,4,4,4,...,4,4,4,4,4]]
-------------- GROUP_BY / AGGREGATE --------------
pyarrow.Table
index: uint64
0: uint64
1: uint64
2: uint64
3: uint64
4: uint64
----
index: [[0,1,2,3,4,...,3,3,3,3,3]]
0: [[0,0,0,0,0,...,4,4,4,4,4]]
1: [[1,1,1,1,1,...,10521510,10521511,10521512,10521513,10521514]]
2: [[2,2,2,2,2,...,0,0,0,0,0]]
3: [[3,3,3,3,3,...,1,1,1,1,1]]
4: [[4,4,4,4,4,...,2,2,2,2,2]]

Component(s)

Python

Version

pyarrow==12.0.1. I downgraded to pyarrow==11, pyarrow==10, pyarrow==9, and pyarrow==8 and observed the above behavior in those versions as well.

Platform

Linux (Ubuntu 22.04 LTS)

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions