Skip to content

Increase max possible value for write.batch-size #17737

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

kokosing
Copy link
Member

@kokosing kokosing commented Jun 2, 2023

Increase max possible value for write.batch-size

In some cases it is desired to have bigger batches.

In some cases it is desired to have bigger batches.
@cla-bot cla-bot bot added the cla-signed label Jun 2, 2023
@kokosing kokosing requested review from wendigo and hashhar June 2, 2023 11:50
Copy link
Member

@hashhar hashhar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is no worse than having no limits. there's no memory accounting anyway so I guess it doesn't matter too much. People setting such large values probably know what they are doing.

maybe we can remove the limit entirely instead, it's just limited by MAX_INT.

Copy link
Member

@hashhar hashhar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM / I don't feel strongly about this

@kokosing kokosing merged commit 26ef0fd into trinodb:master Jun 4, 2023
@kokosing kokosing deleted the origin/master/180_write-batch-limit branch June 4, 2023 19:53
@github-actions github-actions bot added this to the 419 milestone Jun 4, 2023
@@ -21,7 +21,7 @@

public class JdbcWriteConfig
{
static final int MAX_ALLOWED_WRITE_BATCH_SIZE = 1_000_000;
public static final int MAX_ALLOWED_WRITE_BATCH_SIZE = 10_000_000;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

after some experience over time I notice that this isn't as useful as you'd expect.

note that batch size defines the ceiling. for table writer task to create such large batches the input splits must include this many rows which is very very unlikely in practice due to limited size of splits/pages sent to table writer tasks.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for an ideal system once batch size is increased we should somehow signal to the engine the batch size as well so that the table writer gets sufficiently large pages. e.g. even with batch-size = 100k we can see just 10k rows per INSERT being sent.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But we don't need need synchronize pages with batches. If there is not much data there (not enough pages) the latency of batches inserted could be perhaps lower. But if there is huge amount of data and plenty then this change was about increasing the throughput of insert.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with that however I think we can improve this further since in practice it's not always guaranteed that you'll get enough pages timely in order to be able to fill larger batches all the time.

Let's continue offline and decide the next steps if any.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

Successfully merging this pull request may close these issues.

3 participants