Closed
Description
🚀 The feature, motivation and pitch
It seems suboptimal to me that we have to create separate optimized ops just to get basic stuff like parallelization (and vectorization, but let's start with parallelization). Here's what I'd like to do: (The timeline here is "ASAP", but I'm opening an issue because this got too long for chat and so that I can point to this issue on the PRs.)
- Set up a proper CMake build for extension/parallel; right now it's free-riding on buck and getting automatically duplicated into 3 different targets per the generated executorch_srcs.cmake. (done; Add proper CMake build for extension_parallel #8938)
- Make extension_threadpool itself export the
-DET_USE_THREADPOOL
macro we already use and define somewhat ad-hoc. (done; Properly export ET_USE_THREADPOOL from the threadpool extension #8947) - move extension/parallel/thread_parallel.h to core. (@larryliu0820 suggests
runtime/kernel/thread_parallel.h
) (Yes I will leave a stub header behind for backward compatibility.) Move thread_parallel.cpp to threadpool, since there will be no reason not to provide it when threads are available. Provide a default implementation of parallel_for if threadpool is not built (gated behindET_USE_THREADPOOL
) that is just an inlinablefor
loop. (Split & remove extension_parallel #8983) - use
parallel_for
in at least one portable op, either directly or via the workhorse "util" functions. (Add basic parallel_for support to reduce_util #8986) - Verify that, because the optimized library is built with threadpool, it gets parallelization. Adjust build configuration for optimized ops lib if necessary. (Build optimized_portable_kernels if threadpool is enabled #8987)
- Roll out
parallel_for
across portable ops and workhorse "util" functions.
Thoughts? Blockers?
Alternatives
status quo -- slow portable ops
Additional context
No response
RFC (Optional)
No response