Option to run Dataset.map with parallel threads instead of parallel processes #5977

pappacena · 2023-06-21T21:32:14Z

pappacena
Jun 21, 2023

I've been using huggingface's dataset to deal with some computer vision tasks, and dataset.map(..., num_proc > 1) is really handy to run things in parallel. But most of my processing happens in C libs that doesn't actually holds the GIL (like OpenCV, numpy, Pillow, etc), and spinning up new subprocesses seems a bit overkill (specially because dataset.map uses fork method, which may carry some parent process' lifecycle callbacks, for example.

I wonder if it would be interesting for dataset.map() to receive an optional pool: concurrent.futures.Executor parameter, that would be either a ProcessPoolExecutoor or a ThreadPoolExecutor, so the caller could choose which type of parallelization better suites their use case.

There is a gotcha with this proposal, which is the fact that ProcessPoolExecutoor seems to use spawn instead of the current fork approach used by dataset.map, so it wouldn't work with inner functions and lambdas. Because of this, we probably shouldn't replace the current num_proc way of doing it, but I wonder if a new pool parameter could be useful for more people.

If this proposal seems reasonable, I can prepare a PR to further discuss the implementation.

EDIT: related bug report: #5976

lhoestq · 2023-06-22T16:00:57Z

lhoestq
Jun 22, 2023
Maintainer

We started developing a way to switch the parallel backend of datasets based on joblib in the datasets.parallel submodule.

joblib has a "threading" backend which could help here

2 replies

pappacena Jun 25, 2023
Author

@lhoestq that's great! Let me know if I can help with something.

lhoestq Jun 26, 2023
Maintainer

I opened #5991 with some details of what's missing in case you're interested :)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Option to run Dataset.map with parallel threads instead of parallel processes #5977

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment 2 replies

{{title}}

{{title}}

{{title}}

Select a reply

Option to run Dataset.map with parallel threads instead of parallel processes #5977

pappacena Jun 21, 2023

Replies: 1 comment · 2 replies

lhoestq Jun 22, 2023 Maintainer

pappacena Jun 25, 2023 Author

lhoestq Jun 26, 2023 Maintainer

pappacena
Jun 21, 2023

Replies: 1 comment 2 replies

lhoestq
Jun 22, 2023
Maintainer

pappacena Jun 25, 2023
Author

lhoestq Jun 26, 2023
Maintainer