-
Notifications
You must be signed in to change notification settings - Fork 536
Add basic parallel_for support to reduce_util #8986
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
c10::irange is how we should be writing
Sorry, where is the outer loop?
|
bool parallel_for(const int64_t begin, const int64_t end, const Func& func) { | ||
return parallel_for(begin, end, internal::GRAIN_SIZE, func); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel we should let this be explicit instead of providing this convenience. Each operator, in my experience, has its own sweet spot and with this convenience function you might think i just need to slap parallel_for and its ok. So I feel its ok to not have the convenience function just to make authors be aware of grain size. Just my opinion
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Each operator, in my experience, has its own sweet spot
do you have any good references on tuning this? I'm having trouble envisioning the sweet spot being dependent on the operator and also independent on the specific CPU, model, input dimensions, etc. My working assumption was that we just have to pick some number.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm having trouble envisioning the sweet spot being dependent on the operator and also independent on the specific CPU, model, input dimensions,
I was meaning to say that it is dependent on all of that. Operator, input sizes CPU etc. For empirical reasons the former like operator and input sizes are easiest to target and more advanced thigns can be done with CPU/SoC. I remember tuning bmm/concat, pytorch/pytorch#59596
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
tuning bmm/concat, pytorch/pytorch#59596
I don't think this type of tuning is generally appropriate -- the CPU operators support all CPUs; attempting to optimize for specific characteristics of one CPU is presumably going to hurt the other ones. If you want a CPU-tuned operator it should be built for that CPU.
For the specific case of grain size tuning, we could hypothetically do something similar to dtype selective build where we let the user compile in a model-specific set of carefully selected grain sizes. I suspect there would be a simpler way to get that job done, though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll just drop the default grain size version for now. can always find/replace it back
To be fair, the Microsoft Parallel Patterns Library's parallel_for agrees with @mergennachin . I just think when we make things with the same name as a PyTorch core primitive, they should work the same; ExecuTorch is PyTorch, to the extent that IMO it is just an unfortunate detail of the present situation that ExecuTorch doesn't just live in pytorch/pytorch. |
I dont quite follow why you find existing API ugly. The other separate reason to either keep the same API or wholesale upgrade everything is convenient code sharing where same abstractions are used for parallelization. This is for example used in AO low-bit kernels. |
is today converted to
which looks like it has two for loops (i.e., whereas this looks like it has one for loop (i.e.,
@swolchok -- well in theory portable ops was first started and readability was one of the original goals so it looks like we're regressing with this (though today, with all the macros, it's hard to understand what's going on anyway, lol) but it is unfortunate that pytorch/pytorch's i agree with the argument that consistency between pt and et is important here, so not blocking the work here. |
Also, one idea is that we can have multiple versions of |
Isnt that just the matter of how the lambda is constructed? I could also do
But I do see your point around parralel_for itself constructing the output loop around |
This is exactly what I said to Scot. But then
Because of this and other nested lambdas and all the convoluted stuff I didnt try to hold my ground on readability anymore |
I feel we should just pick parallel_for_each or some such new API if we go down that route. |
noting that CI is 100% green in case of rebase |
This is incorrect -- the per-element function for reductions cares about the order in which it is applied. We need to port parallel_reduce, and we need to figure out some way to get tests to fail with this as written. |
solved by making parallel_for iterate backward in debug builds. (will send PR after bottom ones get reviewed -- I can only stack up 8 PRs before ghstack tells me my stack is too big and Github will rate limit me) |
Initial parallel_for integration in a portable op. Needed for #8932. Feel free to hold review until rest of stack is ready and we observe successful paralleliztaion.