[C++] data corruption when using `group_by` and `aggregate` on large data sets

### Describe the bug, including details regarding any error messages, version, and platform.

We recently found some data corruption issues when using `group_by` and `aggregate` with large data sets.

I was able to create a minimal reproducible example:

```python
import pyarrow as pa
import pyarrow.compute as pc

COLUMN_COUNT = 5  # <- 4 works fine, 5 causes data corruption
LENGTH = 100_000_000
data = {}
# 'index' = [0, 1, 2, ... , LENGTH-2, LENGTH-1]
data['index'] = pc.indices_nonzero(pc.if_else(True, True, pa.nulls(LENGTH, pa.bool_())))
for i in range(COLUMN_COUNT):
    # fill 'i' with i (ex: '3' = [3, 3, 3, ... 3, 3])
    data[f'{i}'] = pa.nulls(LENGTH, pa.uint64()).fill_null(i)
t = pa.table(data)  # <- create table from data
print('-------------------- ORIGINAL --------------------')
print(t)
a = t.group_by(t.column_names).aggregate([])  # <- should behave like a no-op
a = a.combine_chunks()  # <- not necessary, just improves the print formatting
print('-------------- GROUP_BY / AGGREGATE --------------')
print(a)
```

In this example the `group_by` and `aggregate` are set up to behave like a no-op and everything works fine with `COLUMN_COUNT <= 4`:

```
-------------------- ORIGINAL --------------------
pyarrow.Table
index: uint64
0: uint64
1: uint64
2: uint64
3: uint64
----
index: [[0,1,2,3,4,...,99999995,99999996,99999997,99999998,99999999]]
0: [[0,0,0,0,0,...,0,0,0,0,0]]
1: [[1,1,1,1,1,...,1,1,1,1,1]]
2: [[2,2,2,2,2,...,2,2,2,2,2]]
3: [[3,3,3,3,3,...,3,3,3,3,3]]
-------------- GROUP_BY / AGGREGATE --------------
pyarrow.Table
index: uint64
0: uint64
1: uint64
2: uint64
3: uint64
----
index: [[0,1,2,3,4,...,99999995,99999996,99999997,99999998,99999999]]
0: [[0,0,0,0,0,...,0,0,0,0,0]]
1: [[1,1,1,1,1,...,1,1,1,1,1]]
2: [[2,2,2,2,2,...,2,2,2,2,2]]
3: [[3,3,3,3,3,...,3,3,3,3,3]]
```

...but results in data corruption if `COLUMN_COUNT >= 5`:

```
-------------------- ORIGINAL --------------------
pyarrow.Table
index: uint64
0: uint64
1: uint64
2: uint64
3: uint64
4: uint64
----
index: [[0,1,2,3,4,...,99999995,99999996,99999997,99999998,99999999]]
0: [[0,0,0,0,0,...,0,0,0,0,0]]
1: [[1,1,1,1,1,...,1,1,1,1,1]]
2: [[2,2,2,2,2,...,2,2,2,2,2]]
3: [[3,3,3,3,3,...,3,3,3,3,3]]
4: [[4,4,4,4,4,...,4,4,4,4,4]]
-------------- GROUP_BY / AGGREGATE --------------
pyarrow.Table
index: uint64
0: uint64
1: uint64
2: uint64
3: uint64
4: uint64
----
index: [[0,1,2,3,4,...,3,3,3,3,3]]
0: [[0,0,0,0,0,...,4,4,4,4,4]]
1: [[1,1,1,1,1,...,10521510,10521511,10521512,10521513,10521514]]
2: [[2,2,2,2,2,...,0,0,0,0,0]]
3: [[3,3,3,3,3,...,1,1,1,1,1]]
4: [[4,4,4,4,4,...,2,2,2,2,2]]
```

### Component(s)

Python

### Version

`pyarrow==12.0.1`.  I downgraded to `pyarrow==11`, `pyarrow==10`, `pyarrow==9`, and `pyarrow==8` and observed the above behavior in those versions as well.

### Platform

Linux (Ubuntu 22.04 LTS)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[C++] data corruption when using `group_by` and `aggregate` on large data sets #36295

Describe the bug, including details regarding any error messages, version, and platform.

Component(s)

Version

Platform

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[C++] data corruption when using group_by and aggregate on large data sets #36295

Description

Describe the bug, including details regarding any error messages, version, and platform.

Component(s)

Version

Platform

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

[C++] data corruption when using `group_by` and `aggregate` on large data sets #36295