Skip to content

multiprocessing.pool ThreadPool.imap does not respect memory scarcity #101586

Open
@shaundaley39

Description

Bug report

ThreadPool.imap (also imap_unsorted) consumes vast amounts of memory unnecessarily, degrading performance of systems that use it without this helping in the performance of imap. Where used to process images, this has consumed many GB of RAM and caused out-of-memory issues.

The problem is that ThreadPool.imap internally iterates over the input iterator, loading inputs into memory in a waiting task queue, without limit and without waiting for tasks to be executed or their outputs read by the imap-iterator-consuming processes.

A small example can illustrate the problem:

import time
from multiprocessing.pool import ThreadPool


def slow_function(x: int):
    time.sleep(0.05)
    print("processed inside imap: ", x)

def report_source(x: int):
    print(f"generated input {x}")
    return x

fast_source = map(report_source, range(16))

with ThreadPool(processes=2) as pool:
    list(pool.imap(slow_function, fast_source, chunksize=1))

When run, we see:

$ python reproduction.py 
generated input 0
generated input 1
generated input 2
generated input 3
generated input 4
generated input 5
generated input 6
generated input 7
generated input 8
generated input 9
generated input 10
generated input 11
generated input 12
generated input 13
generated input 14
generated input 15
processed inside imap:  0
processed inside imap:  1
processed inside imap:  2
processed inside imap:  3
processed inside imap:  4
processed inside imap:  5
processed inside imap:  6
processed inside imap:  7
processed inside imap:  8
processed inside imap:  9
processed inside imap:  10
processed inside imap:  11
processed inside imap:  12
processed inside imap:  13
processed inside imap:  14
processed inside imap:  15

If the function being run inside imap is only a little slower than the process supplying inputs (if we're bothering to make execution concurrent, this will often be the case!), we have an imap task queue that rapidly grows to consume all available system memory (unless there isn't enough input for that).

Within imap (and imap_unsorted) there are two SimpleQueue structures that can grow to arbitrary length:

  • Pool._taskqueue will grow to arbitrary length if the input iterator is able to yield input items faster than they are processed by the imap function
  • Pool._items will grow to arbitrary length if the process consuming the pool.imap iterator is slower than the imap function processing inputs.

It could be argued that an imap function which respected system memory scarcity would be a "feature". Imap has only 2 advantages over map (that I'm aware of): it can begin mapping from input to output before all of the input is available, and it is able to work where not enough memory can be allocated to have all the inputs in memory simultaneously. For users that care about the second (more common?) objective when using imap, respecting memory scarcity is not a feature; failure to respect scarcity is a bug. That's why I've made this a "Bug report" issue.

Your environment

This is environment-independent.

Metadata

Assignees

No one assigned

    Labels

    performancePerformance or resource usagestdlibPython modules in the Lib dirtopic-multiprocessingtype-bugAn unexpected behavior, bug, or error

    Type

    No type

    Projects

    • Status

      No status

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions