multiprocessing.pool ThreadPool.imap does not respect memory scarcity #101586
Description
Bug report
ThreadPool.imap (also imap_unsorted) consumes vast amounts of memory unnecessarily, degrading performance of systems that use it without this helping in the performance of imap. Where used to process images, this has consumed many GB of RAM and caused out-of-memory issues.
The problem is that ThreadPool.imap internally iterates over the input iterator, loading inputs into memory in a waiting task queue, without limit and without waiting for tasks to be executed or their outputs read by the imap-iterator-consuming processes.
A small example can illustrate the problem:
import time
from multiprocessing.pool import ThreadPool
def slow_function(x: int):
time.sleep(0.05)
print("processed inside imap: ", x)
def report_source(x: int):
print(f"generated input {x}")
return x
fast_source = map(report_source, range(16))
with ThreadPool(processes=2) as pool:
list(pool.imap(slow_function, fast_source, chunksize=1))
When run, we see:
$ python reproduction.py
generated input 0
generated input 1
generated input 2
generated input 3
generated input 4
generated input 5
generated input 6
generated input 7
generated input 8
generated input 9
generated input 10
generated input 11
generated input 12
generated input 13
generated input 14
generated input 15
processed inside imap: 0
processed inside imap: 1
processed inside imap: 2
processed inside imap: 3
processed inside imap: 4
processed inside imap: 5
processed inside imap: 6
processed inside imap: 7
processed inside imap: 8
processed inside imap: 9
processed inside imap: 10
processed inside imap: 11
processed inside imap: 12
processed inside imap: 13
processed inside imap: 14
processed inside imap: 15
If the function being run inside imap is only a little slower than the process supplying inputs (if we're bothering to make execution concurrent, this will often be the case!), we have an imap task queue that rapidly grows to consume all available system memory (unless there isn't enough input for that).
Within imap (and imap_unsorted) there are two SimpleQueue structures that can grow to arbitrary length:
Pool._taskqueue
will grow to arbitrary length if the input iterator is able to yield input items faster than they are processed by the imap functionPool._items
will grow to arbitrary length if the process consuming thepool.imap
iterator is slower than the imap function processing inputs.
It could be argued that an imap function which respected system memory scarcity would be a "feature". Imap has only 2 advantages over map (that I'm aware of): it can begin mapping from input to output before all of the input is available, and it is able to work where not enough memory can be allocated to have all the inputs in memory simultaneously. For users that care about the second (more common?) objective when using imap, respecting memory scarcity is not a feature; failure to respect scarcity is a bug. That's why I've made this a "Bug report" issue.
Your environment
This is environment-independent.
Metadata
Assignees
Labels
Type
Projects
Status
No status