DRILL-4411: hash join should limit batch based on size and number of records#381
DRILL-4411: hash join should limit batch based on size and number of records#381minji-kim wants to merge 1 commit intoapache:masterfrom
Conversation
| private HashJoinBatch outgoingJoinBatch = null; | ||
|
|
||
| private static final int TARGET_RECORDS_PER_BATCH = 4000; | ||
| private int targetRecordsPerBatch = 4000; |
There was a problem hiding this comment.
I would add a comment here about when this value will be mutated as we are moving it away from immutability, especially since most of the other operators currently have this as an immutable value.
|
I made it such that it doesn't adjust the batch size once, but keep the minimum batch size (in terms of number of records) to be at least 1. |
| } | ||
|
|
||
| container.clear(); | ||
| outputAllocator.close(); |
There was a problem hiding this comment.
outputAllocator can be null, right ?
There was a problem hiding this comment.
Yes, it could be if we close before calling next()/buildSchema(). I will add a check for null.
…ber of records Make batch size (in bytes and records) adjustments more than once Minor change: check for null outputAllocator before closing
|
@minji-kim Thanks for putting the effort into making this change. Unfortunately a sophisticated version of this was recently completed in this PR #1227 and will be merged soon. Since the objectives of the two changes were the same I will close this PR. Sorry that your change did not receive any attention earlier. |
Right now, hash joins can run out of memory if records are large since the batch is limited only by size (of 4000). This patch implements a simple heuristic. If the allocator for the outputs become larger than 10 MB before outputing 4000 records (say 2000), then set the batch size limit to 2000 for the future batches.