Replies: 1 comment 2 replies
-
We started developing a way to switch the parallel backend of
|
Beta Was this translation helpful? Give feedback.
2 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
I've been using huggingface's dataset to deal with some computer vision tasks, and
dataset.map(..., num_proc > 1)
is really handy to run things in parallel. But most of my processing happens in C libs that doesn't actually holds the GIL (like OpenCV, numpy, Pillow, etc), and spinning up new subprocesses seems a bit overkill (specially becausedataset.map
uses fork method, which may carry some parent process' lifecycle callbacks, for example.I wonder if it would be interesting for
dataset.map()
to receive an optionalpool: concurrent.futures.Executor
parameter, that would be either aProcessPoolExecutoor
or aThreadPoolExecutor
, so the caller could choose which type of parallelization better suites their use case.There is a gotcha with this proposal, which is the fact that
ProcessPoolExecutoor
seems to usespawn
instead of the currentfork
approach used bydataset.map
, so it wouldn't work with inner functions and lambdas. Because of this, we probably shouldn't replace the currentnum_proc
way of doing it, but I wonder if a newpool
parameter could be useful for more people.If this proposal seems reasonable, I can prepare a PR to further discuss the implementation.
EDIT: related bug report: #5976
Beta Was this translation helpful? Give feedback.
All reactions