-
Notifications
You must be signed in to change notification settings - Fork 3.9k
Closed
Labels
Component: C++Component: PythonCritical FixBugfixes for security vulnerabilities, crashes, or invalid data.Bugfixes for security vulnerabilities, crashes, or invalid data.Type: bug
Milestone
Description
Describe the bug, including details regarding any error messages, version, and platform.
We recently found some data corruption issues when using group_by and aggregate with large data sets.
I was able to create a minimal reproducible example:
import pyarrow as pa
import pyarrow.compute as pc
COLUMN_COUNT = 5 # <- 4 works fine, 5 causes data corruption
LENGTH = 100_000_000
data = {}
# 'index' = [0, 1, 2, ... , LENGTH-2, LENGTH-1]
data['index'] = pc.indices_nonzero(pc.if_else(True, True, pa.nulls(LENGTH, pa.bool_())))
for i in range(COLUMN_COUNT):
# fill 'i' with i (ex: '3' = [3, 3, 3, ... 3, 3])
data[f'{i}'] = pa.nulls(LENGTH, pa.uint64()).fill_null(i)
t = pa.table(data) # <- create table from data
print('-------------------- ORIGINAL --------------------')
print(t)
a = t.group_by(t.column_names).aggregate([]) # <- should behave like a no-op
a = a.combine_chunks() # <- not necessary, just improves the print formatting
print('-------------- GROUP_BY / AGGREGATE --------------')
print(a)In this example the group_by and aggregate are set up to behave like a no-op and everything works fine with COLUMN_COUNT <= 4:
-------------------- ORIGINAL --------------------
pyarrow.Table
index: uint64
0: uint64
1: uint64
2: uint64
3: uint64
----
index: [[0,1,2,3,4,...,99999995,99999996,99999997,99999998,99999999]]
0: [[0,0,0,0,0,...,0,0,0,0,0]]
1: [[1,1,1,1,1,...,1,1,1,1,1]]
2: [[2,2,2,2,2,...,2,2,2,2,2]]
3: [[3,3,3,3,3,...,3,3,3,3,3]]
-------------- GROUP_BY / AGGREGATE --------------
pyarrow.Table
index: uint64
0: uint64
1: uint64
2: uint64
3: uint64
----
index: [[0,1,2,3,4,...,99999995,99999996,99999997,99999998,99999999]]
0: [[0,0,0,0,0,...,0,0,0,0,0]]
1: [[1,1,1,1,1,...,1,1,1,1,1]]
2: [[2,2,2,2,2,...,2,2,2,2,2]]
3: [[3,3,3,3,3,...,3,3,3,3,3]]
...but results in data corruption if COLUMN_COUNT >= 5:
-------------------- ORIGINAL --------------------
pyarrow.Table
index: uint64
0: uint64
1: uint64
2: uint64
3: uint64
4: uint64
----
index: [[0,1,2,3,4,...,99999995,99999996,99999997,99999998,99999999]]
0: [[0,0,0,0,0,...,0,0,0,0,0]]
1: [[1,1,1,1,1,...,1,1,1,1,1]]
2: [[2,2,2,2,2,...,2,2,2,2,2]]
3: [[3,3,3,3,3,...,3,3,3,3,3]]
4: [[4,4,4,4,4,...,4,4,4,4,4]]
-------------- GROUP_BY / AGGREGATE --------------
pyarrow.Table
index: uint64
0: uint64
1: uint64
2: uint64
3: uint64
4: uint64
----
index: [[0,1,2,3,4,...,3,3,3,3,3]]
0: [[0,0,0,0,0,...,4,4,4,4,4]]
1: [[1,1,1,1,1,...,10521510,10521511,10521512,10521513,10521514]]
2: [[2,2,2,2,2,...,0,0,0,0,0]]
3: [[3,3,3,3,3,...,1,1,1,1,1]]
4: [[4,4,4,4,4,...,2,2,2,2,2]]
Component(s)
Python
Version
pyarrow==12.0.1. I downgraded to pyarrow==11, pyarrow==10, pyarrow==9, and pyarrow==8 and observed the above behavior in those versions as well.
Platform
Linux (Ubuntu 22.04 LTS)
idailylife
Metadata
Metadata
Assignees
Labels
Component: C++Component: PythonCritical FixBugfixes for security vulnerabilities, crashes, or invalid data.Bugfixes for security vulnerabilities, crashes, or invalid data.Type: bug