Description
Problem
Chunk scanning is a great feature to isolate faulty items in fault-tolerant chunk-oriented steps. However, the current implementation of this feature is quite "naive" and performs poorly. In fact, if an item is erroneous, the chunk is scanned item by item where each item is reprocessed in its own transaction. This means for a chunk size of 1000, if an item is faulty, there will be 1001 transaction (the first transaction that failed for the entire chunk + 1000 transactions, one for each item).
Suggested solution
Chunk scanning is similar to searching an item in a list. The idea of "binary chunk scanning" is to apply the same principle of binary search to look for the faulty item, with one difference: instead of applying a transaction for each item, we would apply a transaction for each half of the chunk on each iteration.
Here is an example: Let's say we have a chunk of 8 items [A, B, C, D, E, F, G, H]
and that C
is the faulty item we are looking for. The binary chunk scanning implementation would apply the following transactions:
So for a chunk size of 1000, there will be only 21 transactions compared to 1001 with the current approach. This is two orders of magnitude faster and should greatly improve the performance of fault-tolerant chunk-oriented steps.
Thoughts / feedback / ideas are welcome!