Skip to content

Conversation

@alamb
Copy link

@alamb alamb commented Nov 6, 2025

It basically restores the original code path when all views fit in a single buffer, and only computes the GcCopyGroup when needed. I hope this will restore performance

}
let (views_buf, data_blocks) = if total_large < i32::MAX as usize {
// fast path, the entire data fits in a single buffer
// 3) Allocate exactly capacity for all non-inline data
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this restores the original code and then only drops into the slow path when needed

}
},
);
views_buf.extend(new_views);
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also tried to make this faster using extend rather than push, which might help

@alamb
Copy link
Author

alamb commented Nov 6, 2025

🤖 ./gh_compare_arrow.sh Benchmark Script Running
Linux aal-dev 6.14.0-1018-gcp apache#19~24.04.1-Ubuntu SMP Wed Sep 24 23:23:09 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
Comparing alamb/fix_gc (a962382) to 2eabb59 diff
BENCH_NAME=view_types
BENCH_COMMAND=cargo bench --features=arrow,async,test_common,experimental --bench view_types
BENCH_FILTER=
BENCH_BRANCH_NAME=alamb_fix_gc
Results will be posted here when complete

@alamb
Copy link
Author

alamb commented Nov 6, 2025

🤖: Benchmark completed

Details

group                                             alamb_fix_gc                           main
-----                                             ------------                           ----
gc view types all without nulls[100000]           1.01  1572.0±51.03µs        ? ?/sec    1.00  1551.2±45.37µs        ? ?/sec
gc view types all without nulls[8000]             1.00     67.7±2.87µs        ? ?/sec    1.01     68.4±3.82µs        ? ?/sec
gc view types all[100000]                         1.02    272.6±8.47µs        ? ?/sec    1.00   268.4±11.06µs        ? ?/sec
gc view types all[8000]                           1.00     18.7±0.13µs        ? ?/sec    1.01     18.9±0.06µs        ? ?/sec
gc view types slice half without nulls[100000]    1.04   562.0±10.43µs        ? ?/sec    1.00   539.7±21.37µs        ? ?/sec
gc view types slice half without nulls[8000]      1.00     27.8±0.29µs        ? ?/sec    1.00     27.9±0.26µs        ? ?/sec
gc view types slice half[100000]                  1.00    127.1±2.59µs        ? ?/sec    1.01    128.9±2.84µs        ? ?/sec
gc view types slice half[8000]                    1.01      9.6±0.03µs        ? ?/sec    1.00      9.5±0.03µs        ? ?/sec
view types slice                                  1.00    706.6±1.89ns        ? ?/sec    1.00    706.1±5.87ns        ? ?/sec

}];
let gc_copy_groups = if total_large > i32::MAX as usize {
// Slow-path: need to split into multiple copy groups
let mut groups = Vec::with_capacity(total_large / (i32::MAX as usize) + 1);
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@alamb assume a bad case which every string view is 1.5GiB, this would be smaller than expected ( this is why I don't use estimate previously, I think re-allocation here would not be the bottlenect, and the actual size might be more than total_large / (i32::MAX as usize) + 1 )

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@alamb assume a bad case which every string view is 1.5GiB, this would be smaller than expected

I don't quite follow this logic. For example

  • if we had 3 string views each of 1.5G
  • total_large would be 4.5G
  • Then the capacity woudl be (4.5G / 2G + 1) = 3

Which is how many buffers are needed 🤔

( this is why I don't use estimate previously, I think re-allocation here would not be the bottlenect, and the actual size might be more than total_large / (i32::MAX as usize) + 1 )

I agree reallocation is likely not to be the bottleneck

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean, assume 100Bytes for a BinaryView and 1.5G for a BinaryView, the case would be different. Assume dest size is 15GiB:

  1. 100Bytes, it would be (15 / 2G + 1) = 8
  2. 1.5G, it would require 10 lists, since 2Gib would at most have 1 binary view

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(A better example is every BinaryView.len() is 1GiB and 1byte more ( Or string interleaves with 1Gib, 1byte, 1Gib, 1byte...)

@mapleFU
Copy link
Owner

mapleFU commented Nov 8, 2025

Would it be better if I extract a new function for slow path?

@mapleFU mapleFU merged commit 7bab606 into mapleFU:gc-fixing-huge-batch-bug Nov 8, 2025
1 check passed
@alamb
Copy link
Author

alamb commented Nov 9, 2025

Would it be better if I extract a new function for slow path?

Yes I think that would be good. I didn't do it in this PR to keep the diff smaller

Would you like me to do so ? Or would you like to?

@alamb alamb deleted the alamb/fix_gc branch November 9, 2025 16:57
@mapleFU
Copy link
Owner

mapleFU commented Nov 10, 2025

I would like to do this when I back home, may at 7PM in UTC-8 , I may not have time for this when I'm in office :-(

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants