-
Notifications
You must be signed in to change notification settings - Fork 3.2k
Increase max possible value for write.batch-size #17737
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Increase max possible value for write.batch-size #17737
Conversation
In some cases it is desired to have bigger batches.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is no worse than having no limits. there's no memory accounting anyway so I guess it doesn't matter too much. People setting such large values probably know what they are doing.
maybe we can remove the limit entirely instead, it's just limited by MAX_INT.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM / I don't feel strongly about this
@@ -21,7 +21,7 @@ | |||
|
|||
public class JdbcWriteConfig | |||
{ | |||
static final int MAX_ALLOWED_WRITE_BATCH_SIZE = 1_000_000; | |||
public static final int MAX_ALLOWED_WRITE_BATCH_SIZE = 10_000_000; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
after some experience over time I notice that this isn't as useful as you'd expect.
note that batch size defines the ceiling. for table writer task to create such large batches the input splits must include this many rows which is very very unlikely in practice due to limited size of splits/pages sent to table writer tasks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for an ideal system once batch size is increased we should somehow signal to the engine the batch size as well so that the table writer gets sufficiently large pages. e.g. even with batch-size = 100k we can see just 10k rows per INSERT being sent.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But we don't need need synchronize pages with batches. If there is not much data there (not enough pages) the latency of batches inserted could be perhaps lower. But if there is huge amount of data and plenty then this change was about increasing the throughput of insert.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with that however I think we can improve this further since in practice it's not always guaranteed that you'll get enough pages timely in order to be able to fill larger batches all the time.
Let's continue offline and decide the next steps if any.
Increase max possible value for write.batch-size
In some cases it is desired to have bigger batches.