-
Notifications
You must be signed in to change notification settings - Fork 178
Enhancement: Optimize Pax table insertion performance #1364
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Enhancement: Optimize Pax table insertion performance #1364
Conversation
7f0085c
to
a342620
Compare
a342620
to
72d88a1
Compare
688b7cf
to
50ef57c
Compare
50ef57c
to
254eb7b
Compare
bc3ca05
to
d479902
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM with single comment
if (strategy_->ShouldSplit(writer_->PhysicalSize(), num_tuples_)) { | ||
// Sampled split check to reduce PhysicalSize() overhead | ||
// We first perform a sampled pre-write check to avoid empty files. | ||
if ((num_tuples_ % PAX_SPLIT_CHECK_INTERVAL) == 0) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should be?
if ((num_tuples_ % PAX_SPLIT_CHECK_INTERVAL) == 0) {
cur_physical_size_ = writer_->PhysicalSize();
if (strategy_->ShouldSplit(cur_physical_size_, num_tuples_)) {
...
}
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because of the CTID constraint, we have to strictly enforce the accuracy of the tuple count and make sure it doesn't exceed PAX_MAX_NUM_TUPLES_PER_FILE. That's why we kept this precise check here.
On the other hand, the biggest performance hit here is the PhysicalSize() function. So to reduce the overhead of calling it so often, we only check the file size every PAX_SPLIT_CHECK_INTERVAL tuples.
d479902
to
eee39f5
Compare
1. Explicit inline functions. 2. Instead of checking the physical size of the file every time a tuple is inserted, it is checked every 16 tuples. performance result ``` create table t1(a int, b int, c int, d int, e text, f text,g text, h text) using pax with(compresstype =zstd,compresslevel=5); gpadmin=# insert into t1 select i, i+1,i+2,i+3, i::text, i::text, i::text, i::text from generate_series(1,5000000) i; INSERT 0 5000000 Time: 6124.535 ms (00:06.125) gpadmin=# insert into t1 select i, i+1,i+2,i+3, i::text, i::text, i::text, i::text from generate_series(1,5000000) i; INSERT 0 5000000 Time: 5993.682 ms (00:05.994) -- optimized with this commit create table t1(a int, b int, c int, d int, e text, f text,g text, h text) using pax with(compresstype =zstd,compresslevel=5); gpadmin=# insert into t1 select i, i+1,i+2,i+3, i::text, i::text, i::text, i::text from generate_series(1,5000000) i; INSERT 0 5000000 Time: 5713.184 ms (00:05.713) gpadmin=# insert into t1 select i, i+1,i+2,i+3, i::text, i::text, i::text, i::text from generate_series(1,5000000) i; INSERT 0 5000000 Time: 5430.221 ms (00:05.430) ```
eee39f5
to
4cc222b
Compare
performance result
Fixes #ISSUE_Number
What does this PR do?
Type of Change
Breaking Changes
Test Plan
make installcheck
make -C src/test installcheck-cbdb-parallel
Impact
Performance:
User-facing changes:
Dependencies:
Checklist
Additional Context
CI Skip Instructions