-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
rfcs: proposal for a reusable kernel knob #1743
base: rfcs
Are you sure you want to change the base?
rfcs: proposal for a reusable kernel knob #1743
Conversation
|
||
Primitive creation is an expensive operation, which is why oneDNN has a | ||
primitive cache, a persistent cache, and a kernel cache (for the GPU). For GPU | ||
implementations, the kernel cache is optional and must be opted into. Although |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My understanding is that the kernel cache is enabled by default because we agreed to make it an internal optimization without exposing it to the user.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The kernel cache is enabled by default externally, but the implementations themselves have to opt into the kernel cache API internally. This is a middle-ground solution decided on when developing the kernel cache internal API, but it also helps avoid filling the kernel cache with single-use kernels that are retrieved from the primitive cache already.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see, maybe it makes sense to emphasize that this is about enabling GPU implementations to use the kernel cache i.e. making the implementations kernel cache friendly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This RFC is not about making the implementations kernel cache friendly - there are several implementations that already use this internal API to make use of the kernel cache. This RFC is about allowing oneDNN users to hint that it's preferable to use these kernels over any others, if possible. Since these kernels may have lower performance, this enables the user to specify that lower performance is alright, since primitive creation time might be much faster.
Compared to the build flag it's more flexible, and compared to the environment | ||
variable it's less opaque. | ||
|
||
### Option 1: Using DNNL_RUNTIME_DIM_VAL |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using DNNL_RUNTIME_DIM_VAL
would require a primitive to contain the entire set of the kernels, which means primitive creation would be very slow as we would have to compile all of the kernels at once. This sounds like the biggest disadvantage of the approach to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we have an option to do kernel compilation in a lazy way relying on the kernel cache?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Having a lazy kernel compilation will mean that we can no longer guarantee that the primitive creation and execution are truly separate because the lazy compilation would have to be performed at the execution time. Also, that would mean that we change the state of the primitive after its creation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we would have to compile all of the kernels at once
I don't understand this point. If the user supplies one or more DNNL_RUNTIME_DIM_VAL
values, only an implementation that supports runtime dimensions will get dispatched. The intention is that any runtime dimensions will be passed at runtime instead of being compiled into the kernel, thus making a single kernel work for the entire class of problems with changing runtime sizes.
To be clear, I don't think this solution is optimal, because it limits optimization potential (e.g. we can't rely on divisibility conditions) in addition to limitations on its practical use in frameworks. I just don't see why it would require compilation of all possible kernels at once.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Simonsays095, the issue here is that, depending on which dimensions are set to DNNL_RUNTIME_DIM_VAL
, there may be multiple reusable specializations that apply. Take for example the simple gen9_eltwise implementation. There is currently a compile time parameter with_overflow that depends on the total buffer size modulo a constant. As such, we would need to compile all (or some implementation specific subset of) applicable specializations to achieve the targeted performance before dispatching problems.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rjoursler how would that be different with the hint vs DNNL_RUNTIME_DIM_VAL
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mgouicem, there is a significant difference in the amount of information. At a high level, we have a set of specializations and a dispatching heuristic mapping a problem to a specialization. With DNNL_RUNTIME_VAL
, we need to cover multiple problems, and, as a consequence, the dispatcher must map the set of problems to a set of specializations. For the hint, we are just reducing the set of specializations considered by the dispatching heuristic
- The setting is not apparent in the code, leads to opaque behaviour | ||
- Requires internal changes to detect implementation reusability | ||
|
||
### Option 3: Environment variable |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the main question here is, do we want to provide a per-primitive control or a single knob for all primitives? The per-primitive option looks more flexible to me and the primitive attribute (option №4) would be the way to go. User could introduce their own environment variable that would control the primitive attribute if needed.
performance impacts associated with it) or the typical performance-optimized | ||
kernel. This proposal is to add a knob the user can control to specify whether | ||
they would like reusable kernels to be dispatched preferentially over the | ||
standard performance-optimized kernel ordering for GPU implementations. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you expound on the necessity of this attribute. From my perspective, this is largely a temporary feature enabling users to call reusable implementations that have performance limitations. Once we completely switch to reusable kernels, this switch becomes unnecessary. Can you provide examples that fundamentally require non-reusable implementations to get high performance? @echeresh, are there any such operations you have noted in convolution? To my understanding, the main issue is that multiple layout combinations when combined with the 3d nd_range
can make offset calculation expensive due to the use of division/modulus operations to unpack a dimension (it may be possible to mitigate this by adding padding threads
so division is performed by powers of 2 or by blocking the problem into larger chunks.).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you provide examples that fundamentally require non-reusable implementations to get high performance? @echeresh, are there any such operations you have noted in convolution?
With enough specialization we can get high enough performance. But the level of specialization is important: e.g. due to opportunities for unrolling and reduced arithmetic we may need to have a specialization for 3x3 kernel. In practice it may not be feasible to specialize for all such cases so we have to trade performance for reusability. I think it's fine to assume reusable kernels can give us high enough performance but full JITting can add a little bit more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@echeresh, You are not quite answering what I am asking. While there is definitely a specialization vs performance tradeoff, an attribute to prefer reusable kernels does not really address this. While there is a large spectrum of options here, I am specifically wondering about useful specializations we currently use which are fundamentally opposed to kernel reuse. For an example, consider the following:
int div(int m, int n) {
return m/n;
}
Using our current JIT method, we can implement this operation significantly faster by relying on division by a constant to transform the division into a combination of shifts, adds, and multiplications. I consider this specialization as fundamentally opposed to kernel reuse though, as it requires specialization for every possible input n
. We actually hit this issue in the reusable bathnorm implementation. The process of transforming global_ids -> dim ids -> buffer offset
required division/modulus, so a naive translation had poor performance. This issue was avoided by a more direct calculation of global_ids -> buffer_offset
. This calculation is not always efficient due to packing multiple buffers with n-dimensions into the 3d range supported by the runtime, but in practice, I suspect this is not an issue. I am wondering if you have seen any other "fundamentally opposed" specializations like this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using our current JIT method, we can implement this operation significantly faster by relying on division by a constant to transform the division into a combination of shifts, adds, and multiplications.
For this particular operation it's not that bad:
- Division is performed before the main loop hence it adds only a constant overhead per thread
- Division by a non-constant can be done by pre-computing special "magic" values on the host so we end up with similar instructions for shifting/multiplication. It's not for free as this results in extra overhead but this is not something significant
So I don't think there is something fundamentally opposed to kernel reuse. At the same time full JITting provides opportunities for optimization as we can avoid passing/reading kernel arguments and reduce arithmetic/branching in some cases (like division by power of two, unrolling of loops, etc).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @echeresh. That aligns with my expectation, but I think it is important to confirm it. My personal opinion is that we should not maintain the infrastructure for full specialization unless we find "fundamentally opposed" specializations that provide significant benefit (maybe defined as >5% performance increase on multiple KPI's). Instead, I think it will be more productive to focus our attention on improving methods for finding/dispatching/implementing
reusable specializations. While there may be a need for a trade-off controlling attribute, with convolution/gemm the only implementations requiring it, I think nGEN based generation may be fast enough to not justify the extra finding/dispatching
complexity.
Pulling this back around to the details in this RFC @Simonsays095 , if our eventual goal is to have every implementation be reusable, then this attribute will become obsolete. As such, I wonder if it is justified to include it in the library API. I think we need a more clarity on what the long term goal is, whether an attribute is necessary, or if we should investigate a trade-off
or dispatcher method
control attribute more directly, especially as that has appeared in discussions here multiple times.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm working on updating the RFC to address some confusion about the goal, but the short answer is that we need a near-term solution for frameworks to be able to select reusable kernels preferentially, even if they are not optimized or near performance parity with our JIT kernels. In the short term, feedback from frameworks is that such control over the creation time/execution time tradeoff is needed to optimize performance. Additionally, this will allow us to more rapidly develop reusable kernels without requiring performance parity. In the long term, it's possible that such a control becomes obsolete (after many implementations have been rewritten and updated - likely several years from now), however I would argue it's likely that there will always be cases (whether due to limitations of existing implementations or due to fundamental restrictions on reusable kernels) where this tradeoff can be leveraged to improve performance.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My personal opinion is that we should not maintain the infrastructure for full specialization unless we find "fundamentally opposed" specializations that provide significant benefit
I agree. There are two aspects in the context of moving from one paradigm (JIT) to another (reusable kernels):
- Development resource allocation. From some point going forward we should prefer investing more resources in improving "reusable" kernels rather than JIT kernels
- For some primitives it may be more straightforward to reach the same level and drop the non-reusable kernels completely (think eltwise), for other primitives - we have to maintain two paths for some time
- Performance gap between JIT vs reusable kernels. It will likely take quite a bit of time to close/reduce the gap so we can address this gradually:
- Step 1. Expose a control to the user to opt in for the new reusable kernels (this RFC). This way we can continue development of reusable kernels bringing more performance/functional improvements
- Step 2. Once performance is at a good level and coverage is acceptable we make it default starting from some future architecture
they would like reusable kernels to be dispatched preferentially over the | ||
standard performance-optimized kernel ordering for GPU implementations. | ||
|
||
> **Note:** Reusable kernels are defined as GPU kernels with a finite set of |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Finite is not necessarily a requirement. For example, how does finite apply with post-ops?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With post-ops we can have reusability only when the number of used post-op chains is reasonably limited. Otherwise in general kernels are not reusable.
Do you have anything more specific in mind for the formal definition of "reusable kernels"? As a criteria, I thought we can say that some functionality is covered with reusable kernels only when they can be potentially converted to AOT kernels hence we need a fairly limited set of unique kernels. With post-ops it's not the case so we can't cover them with reusable kernels given the current implementation approach.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To me, a reusable kernel is a specialization intended for use on a "large/notable" set of inputs. For example, a kernel with post-ops may be reused when the same post-ops are applied. This kind of support is still beneficial for dynamic workloads where some of the input data dimensions are changing between iterations, but the core operation is the same, such as can occur with matmul+eltwise
fusion. While AOT compiled kernels must be reusable (and additionally meet memory size constraints), this is a strict subset of what I think we need to support.
helpful to the user to be able to specify if they prefer to use a kernel that | ||
has been designed to be reusable via the kernel cache (accepting any possible | ||
performance impacts associated with it) or the typical performance-optimized | ||
kernel. This proposal is to add a knob the user can control to specify whether |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you please elaborate on how/when you expect the end user to use this feature? Do frameworks plan to expose it? Have you discussed it with the framework teams?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have been in discussions with frameworks regarding this feature, which is why I mentioned that option 4 (primitive attribute) is preferred by frameworks. To my knowledge, they don't plan on exposing this option to the user. The idea is to use this knob for models which contain a large number of unique (dynamic) shapes/primitives to give a hint to oneDNN to use flexible/reusable kernels and reduce primitive creation time as much as possible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Simonsays095, this solution sounds like something we could do in IPEX, but not in upstream. Did we discuss this with Tensorflow and Pytorch developers?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The idea is to use this knob for models which contain a large number of unique (dynamic) shapes/primitives to give a hint to oneDNN to use flexible/reusable kernels and reduce primitive creation time as much as possible.
From what I recall of the email chain, it seems FWKs cannot know how dimensions will change between two executions, so FWK/IPEX cannot know when to set the attribute or not. My understanding is that they would always set the hint which would lead to performance regressions when shapes do not effectively change from one run to the next.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My understanding is that they would always set the hint which would lead to performance regressions when shapes do not effectively change from one run to the next.
I think this (enabling this hind by default) is not an option. We expect the users to opt in to use reusable kernels in *some* scenarios when it makes sense to do so (e.g. for dynamic-size models).
- The setting is not apparent in the code, leads to opaque behaviour | ||
- Requires internal changes to detect implementation reusability | ||
|
||
### Option 4: Primitive attribute (preferred) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Runtime dimensions was a mechanism we introduced to address the need for dynamic shapes and implemented for a subset of oneDNN functionality already. I would suggest we stick to it instead of introducing additional set of knobs serving similar purpose.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Runtimes dimensions mechanism has two major issues (I think this may need to be better covered in the RFC):
- The mechanism is relatively hard for the frameworks to use (per their feedback) as it's not trivial to determine a subset of dimensions that are dynamic (for convolution: only spatial dimensions? filter sizes as well? what about strides?). It may be highly model-dependent which requires pre-analysis of the model on the framework side. It's not a simple solution.
- From the implementation side it's not trivial either. Even if we know some dimensions are dynamic - for decent performance we may need many kernels for good coverage. It's not clear when exactly to compile these different kernels. What if in the further runs only a subset of kernels will be used (then the user is paying extra overhead)? If we keep the implementation truly universal - performance is severely limited.
For the user it would be simpler to have mechanisms to control the reusability/performance trade-off at a high level depending on the scenario:
- Reusable kernels: fits dynamic size topologies to be still able to amortize primitive creation across multiple model iterations
- One step forward: for some subset of oneDNN functionality reusable kernels can be converted to AOT kernels to eliminate overhead on the first creation as well
- Full JIT: for the best performance when the model is fixed and there are many model iterations to amortize the creation time from a limited set of unique layers
- Auto-tuning/PGO: for the best performance when the user has enough time to pre-tune layers for a given model
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the user it would be simpler to have mechanisms to control the reusability/performance trade-off at a high level depending on the scenario
Actually, the feedback we got was that FWKs don't want to use such a mechanism since they don't have the information necessary:
- in graph mode, only parameters have fixed shapes, but all other shapes can change from one graph execution to the next
- in eager mode, there is not even a notion of graph or "unit" that is executed multiple times.
My understanding is that they would want primitive creation overheads hidden under oneDNN transparently.
From the implementation side it's not trivial either. Even if we know some dimensions are dynamic - for decent performance we may need many kernels for good coverage. It's not clear when exactly to compile these different kernels. What if in the further runs only a subset of kernels will be used (then the user is paying extra overhead)? If we keep the implementation truly universal - performance is severely limited.
I think we are overthinking here. There are only a few actual CNN use-cases in practice:
- all shapes are static
- only minibatch changes (some inference scenarios)
- both minibatch and spatials change from one execution to the next (object detection)
I guess a good starting point would be to cover the first and last bullets (which covers the second bullet).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My understanding is that they would want primitive creation overheads hidden under oneDNN transparently.
Unfortunately this is not feasible in the short term hence we came up with this RFC. Ideally we want fast kernels with no overhead on creation - this would be the perfect scenario but it's not realistic - at least in the near future.
I think we are overthinking here. There are only a few actual CNN use-cases in practice:
all shapes are static
only minibatch changes (some inference scenarios)
both minibatch and spatials change from one execution to the next (object detection)
I guess a good starting point would be to cover the first and last bullets (which covers the second bullet).
Right but I still don't see how to address the issues I mentioned - dynamic batch/spatial may require several kernels (or dozens of kernels for JIT-level performance) - it's not practical to compile them all when say in practice only 1/10 of all kernels are used.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right but I still don't see how to address the issues I mentioned - dynamic batch/spatial may require several kernels (or dozens of kernels for JIT-level performance) - it's not practical to compile them all when say in practice only 1/10 of all kernels are used.
Correct, this is still part of the creation time vs execution time tradeoff. How does the hint helps though?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The hint helps to amortize creation time via reusable kernels. The end goal is the same as with DNNL_RUNTIME_DIM_VAL
feature while the hint doesn't provide any strict guarantees.
The main problem with DNNL_RUNTIME_DIM_VAL
is the API imposed flow when the user is asked to create a primitive once to be able to call it for any sizes. With such strict guarantees we can't provide a reasonable creation time and good enough performance - the set of sizes to cover is too large. With the hint it's relaxed - we generate kernels on demand without strict guarantees.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok I think I see now thanks! So in a nutshell:
- using DNNL_RUNTIME_DIM_VAL: implementation X which handle dynamic shapes has to jit all its kernels upon creation
- using hint: implementation X which handle dynamic shapes jit only the subset of kernels for the passed shape
So in first case, all is jitted upon first iteration, and then no more jitting.
For the second case, some jitting happen upon first iteration, and extra jitting might or might not happen depending on the useful subset of kernels already jitted.
I guess two things here:
- do we have data on how big the gap is between the two options?
- what about AOT compilation?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we have data on how big the gap is between the two options?
This is in development, and it will take some time until we can say that our "reusable" kernels have all important optimizations for such a comparison to make sense. As a side point, we are quite sure that the JIT option can always provide an extra boost (say at least 5-10%) so it's probably a good idea for oneDNN to allow users to switch between JIT/reusable kernels depending on the scenario.
what about AOT compilation?
This is an option but it needs more exploring. At a first glance, I see two major arguments against AOT:
- It may not be feasible - consider all data type and post-op combinations. If covering everything the library size explodes significantly. It might be we can only do limited AOT support
- I don't think we have any known scenarios when AOT is practically needed. I expect for real model runs we have enough iterations to amortize JIT compilation time. However this is only possible with reusable kernels - when kernels are reused between iterations even for dynamic shapes.
88550b8
to
16152ff
Compare
I've updated the RFC to hopefully clarify some of the questions raised so far and to add some more context to the motivation and goal of the RFC. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess there are a few things that need to be clarified with the RFC:
- how will users rely on the new hint? From previously gathered information, it seems FWKs cannot know if relying on reusable kernel is beneficial or not, as this is highly application dependent.
- It is unclear how we would leverage this hint internally. Would it change dispatched implementation? Would it be used by existing implementation to generate code differently?
- it is unclear how it will support progressive support of dynamic shapes. For example, FWKs adopt this new hint before we add implementations relying on the hint. Now we add an implementation which relies on this hint, it gets dispatched, and they see regressions for applications with static shapes, and improvements for applications with dynamic shapes. What is the fix? We tell them to stop using the hint? we improve the impl to be on-par with its static shape counter part?
Are there any other options available that would not rely on new APIs or build knob?
One proposal could be (taking convolution as example):
- to have a generic kernel that support dynamic batch/spatials (either compiled AOT, or upon first convolution primitive creation call).
- upon convolution op creation, we jit the specialized kernel as we do today, but in the background (primitive creation becomes asynchronous)
- upon convolution op execution, if the jitted impl is ready, we use it, if not, we call the generic kernel
The above should allow:
- for static use-case: to cover jitting cost for the first graph execution. The subsequent calls would rely on jitted kernels, as they do today
- for dynamic use-cases: most common shapes will use jitted kernels (depending on cache hit rate) others will just rely on the generic kernel.
I believe @rjoursler had a few options laid out at some point maybe he can help here.
<u>Cons:</u> | ||
- Several primitives have no implementations that support runtime dims - | ||
cannot be adopted until such implementations have been developed | ||
- Support for all possible runtime dim configurations is difficult |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess this is not necessary, we just need to support those that are actually used in practice right ?
cannot be adopted until such implementations have been developed | ||
- Support for all possible runtime dim configurations is difficult | ||
- Requires users to know which dimensions will have dynamic shapes ahead of | ||
time |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Two things here:
- Opting to use an API to control kernel dispatching will always require user to know if shapes will be dynamic or not, and to have knowledge of the tradeoffs (creation-time gains vs execution time loss).
- From FWK perspective, it is never known which shapes will effectively be dynamic since this is an application property:
- in graph mode, only parameters shapes are fixed and data tensor shapes are all dynamic (except those tied to parameter shapes, like input/output channels for example)
- in eager mode, any shape can be dynamic at the op level, since there is no notion of graph or re-executable unit
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From FWK perspective, it is never known which shapes will effectively be dynamic since this is an application property:
This is a good point. I would check with the frameworks folks whether the hint option can really work for them (e.g. hypothetically then can delegate setting this hint to the end user). If the hint doesn't look good from their side, I'm afraid we don't have any realistic alternatives at this point besides gradually closing the gap between JIT/reusable path and making the reusable path default starting from some architecture in the future.
- The setting is not apparent in the code, leads to opaque behaviour | ||
- Requires internal changes to detect implementation reusability | ||
|
||
### Option 4: Primitive attribute (preferred) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's use this comment to discuss @mgouicem's summary comment. For the future, maybe it would better to leave a summary at a random place so that we can have a discussion by replying to it.
I second Mourad's point that the frameworks such as PyTorch and TensorFlow do not know which dimensions are dynamic and therefore they can either always enable the re-usable kernels or always disable them.
I think there are three use cases that we have seen so far:
- Static dimensions - primitive/kernel cache helps to reduce creation overhead
- Dynamic dimensions with a finite number of unique shapes - primitive/kernel cache with increased capacity helps to reduce creation overhead
- Dynamic dimensions with undefined number of unique shapes - no solution
If the frameworks decide to always enable the re-usable kernels then they will see performance degradations for the first 2 scenarios. Is it something that we have agreed on?
As for the @mgouicem's idea on asynchronous kernel creation, it's an interesting suggestion but it may create problems on its own. First of all in order for the idea to work we need a kernel cache that we already have, which is good. But with that, I think it may not be realistic to define the behavior for the users so that they know what to expect. For example, what if the specialized kernel gets evicted? Do we switch to the generic one or do we create the specialized kernel again? What if the generic kernel gets evicted? What if they both get evicted? How do we ensure deterministic behavior?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For example, what if the specialized kernel gets evicted? Do we switch to the generic one or do we create the specialized kernel again?
The specialized kernel gets evicted only only if all primitive objects using it are destroyed. Which means for next primitive creation that would need it, the same flow applies:
- primitive creation triggers generic and specialized kernel jitting. Because all calls to convolution primitive creation trigger generic kernel compilation, it will very likely be resident in the kernel cache.
- during primitive execution, if the specialized kernel is ready, it gets used, if not it uses the generic kernel.
What if the generic kernel gets evicted?
It gets recreated upon next convolution primitive creation.
How do we ensure deterministic behavior?
Define deterministic :-). In any case, as soon as a cache is used, the user is at the mercy of cache hit/miss. When using the hint, kernel compilation can be triggered at any time depending on both availability of kernel in the cache, and if the kernel for that specialization was already jitted or not.
I still think the async creation inside oneDNN poses a few problems (e.g. availability of threading runtime at creation, it breaks current model if users want to reuse primitive objects and cleanly separate creation overheads from execution).
Another option though would be to implement this strategy on FWK side.
The generic primitive can be created using DNNL_RUNTIME_DIM_VAL
, which would return unimplemented if we don't have it. They can also decide to use persistent cache APIs to make generic primitive creation faster.
Upon op execution, they can create specialized primitive asynchronously (respecting the FWKs threading runtime), and do something like this:
auto specialized_future = std::async(std::launch::async, [] {
return create_specialized_primitive(args);
};
auto generic_future= std::async(std::launch::async, [] {
return create_generic_primitive(args);
};
auto primitive_to_run = wait_first_successful_complete(generic_future, specialize_future);
execute_primitive(primitive_to_run.get(), args);
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The specialized kernel gets evicted only only if all primitive objects using it are destroyed.
But this is not how the kernel cache works. The cache holds kernels even if all the primitives that use them are destroyed. That's the point of the cache.
Define deterministic :-).
For example, we define semantics of the runtime dimensions feature as follows. Pay the price of primitive creation once and then use the primitive for multiple shapes. With the primitive cache the users will have to pay the price of creation only once throughout the workload. This is guaranteed.
When it comes to the asynchronous kernels creation it doesn't seem that we can guarantee anything because the kernels can get evicted at any time and therefore we will have to create them again or use the generic kernels if they are still in the cache, which would lead to unexpected performance effects due to the difference in performance of the kernels (generic vs specialized) and the possible need to create some of the kernels from scratch.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But this is not how the kernel cache works. The cache holds kernels even if all the primitives that use them are destroyed. That's the point of the cache.
I meant it the otherway around, a kernel is not destroyed as long as there is a primitive object referencing it.
With the primitive cache the users will have to pay the price of creation only once throughout the workload. This is guaranteed.
Is it? There is no such guarantee if the number of different primitives is higher than cache capacity. The behavior will both depend on distribution of primitive reuse, and eviction policy.
The only difference between generic + specialized
scheme vs only generic
or only specialized
schemes is the number of primitives/kernel to hold in cache.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I meant it the otherway around, a kernel is not destroyed as long as there is a primitive object referencing it.
Do you mean the existing kernel cache won't help with that?
Is it? There is no such guarantee if the number of different primitives is higher than cache capacity.
You are right, I assumed that the primitive is in primitive cache that my not always be true. Though my point was that we do provide certain guarantees at the primitive level for the primitives with runtime dimensions. But I I see your point now.
The only difference between generic + specialized scheme vs only generic or only specialized schemes is the number of primitives/kernel to hold in cache.
That's true to some extent. But what if we keep creating primitives for unique shapes? Does it mean that we will keep creating specialized kernels but using the generic ones? Won't it just keep polluting the kernel cache causing the generic kernels to be evicted?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you mean the existing kernel cache won't help with that?
Yes and no. It depends on how it gets implemented. If generic kernels are generated AOT, then indeed kernel cache won't help. If generic kernels are jitted, kernel cache is required to keep them alive.
But what if we keep creating primitives for unique shapes? Does it mean that we will keep creating specialized kernels but using the generic ones?
Yes. In some sense, the approach can be summed up as opportunistic: by default we use generic kernels and opportunistically jit the specialized kernel. If we are lucky, it gets reused, if not it goes to waste. It boils down to a question of how much CPU usage matter for workloads running on GPU.
Won't it just keep polluting the kernel cache causing the generic kernels to be evicted?
Yes it will pollute cache, but for each primitive created, we should bump the generic kernel as least recently used. This means there is a very low chance of evicting the generic kernel (unless cache capacity is smaller than number of generic kernels).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I really like the idea of generic + specialized versions as this is a standard method for use in JIT compiled architectures. My initial reaction is that this sounds better than having an attribute for requesting reusable kernels. To some extent, the proposed attribute is intended to call reusable kernels before they are "feature/performance" complete, which is inherently problematic, while the generic + specialized solves the base issue. My main concern with this method is the work required to support a separate generic implementation, it does not sound trivial and it may be better to prioritize other implementations.
In regards to some of the implementation related discussions:
If generic kernels are generated AOT, then indeed kernel cache won't help.
This isn't quite true. It can still cost a (relatively) significant amount of time to transform a precompiled binary into a runtime object. Another point to consider is that, currently, accelerators performance is scaling much faster than general purpose CPU performance. As such, the CPU time required to dispatch work to accelerators is expected to become more critical. Our current amortized target for calling a primitives is on the order of tens of microseconds, while creating a runtime object from binary alone currently take a few milliseconds (I did not investigate if it could be optimized though). As such, we already need to reuse runtime objects an average of 100s of times to hit that target.
In regards to storing generic kernels in the kernel cache, I think this is a bad idea. Instead, my opinion is that we keep them in a separate fixed-size registry. As these are the generic implementations, there should not be very many actual kernels, and caching them just creates edges cases we can avoid at the cost of a small amount of (potentially) unused memory. I also think generic kernels need to be AOT compiled. The idea around launching two expensive kernel creations on first primitive creation seems problematic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another point to consider is that, currently, accelerators performance is scaling much faster than general purpose CPU performance. As such, the CPU time required to dispatch work to accelerators is expected to become more critical.
There is an activity to address the problem via introducing SYCL Graph (specification). We are collaborating with the team that is working on enabling SYCL Graph for IPEX.
Ultimately, we will have something like CUDA Graph.
Some facts from PyTorch side.
I saw the debates on the "Option 1: Using |
Another aspect is about weight packing. We'd prefer to keep a single copy of weight in the framework. What's the implication to that with this RFC? |
@jgong5, I just want to make sure that I understand what
But in both cases it is not guaranteed that the operations in the created graph will always be called for the same shapes. Do I get it right? |
That's not correct. With the graph mode, static sizes won't change during runtime. |
@jgong5, got it. So we have to assume that the non-static sizes may change at runtime even if in certain cases it may not be true? |
That's correct. |
@Simonsays095, looks like the proposed options (except for the |
@jgong5, at this point our main concern (and hence the RFC) is not how to improve performance of models with truly dynamic shapes (e.g. Hifagan on ATS-M), but how not to introduce big perf regression to the models where there's actually no dynamic shapes. Can you please give us an example, are there dynamic shapes in any well-optimized CNN models on ATS-M like RN-50 or Stable Diffusion? If some parameters in some Conv layers are dynamic according to the IPEX knowledge, we will be prioritizing reusable kernels for these layers, and these layers will see regressions. Based on that, I can see two scenarios:
|
@karturov In general, batch dim could be dynamic for conv and gemm (inner-product and batch matmul) and spatial dim could be dynamic for conv.
I won't say non-static shapes are rare. Also, we cannot always know if the shapes could be changing or not. As I explained previously, for eager mode, we have to assume shapes are changing and for graph mode, we have more info and know what specific dims could be changing for individual ops with sample input shapes. That's why I would prefer a flexible API for frameworks to pass all the known information, e.g., able to mark individual dim as dynamic and provide size hint for them. Then, oneDNN can decide how to serve the things best. A coarser-grained API like "dynamic hint" is also fine but that would leave oneDNN less room to specialize for graph mode in the future.
Do you mean to make the performance of these dynamic shape kernels on par with static shape kernels? |
@jgong5 It makes the kernel implementation more complex (and also results in more scenarios to cover in testing) so we would rather keep it simple at the kernel level: either JIT compile all sizes into the kernel or have an internal list of pre-defined kernels supporting dynamic sizes to dispatch between them.
Yeah, that's true but more fine-grained API is usually harder to develop, support and use so maybe better to stick to the simpler version. Another perspective - we don't have performance data to justify investing into more flexible API.
Yes, in the short term dynamic shape kernels can't reach the same level of performance as JIT kernels so they should not be used unless it's absolutely needed (like for models that are known to expose dynamic shapes). |
If these dynamic hints for individual dims are passed as hints, you have the flexibility to choose the algorithms you like. You can keep it simple to either JIT compile all sizes or use AOT-compiled kernels or something more complicated. Finer-grained semantics not necessary make your implementation harder. |
That adds complexity at our side, but is doable. At least I believe we can do better when, for example, only batch size is dynamic. If these parameters are hints, we still have enough flexibility, in particular to ignore them and still generate JIT once we don't have a good reusable kernel for a combination of dynamic shapes yet. @echeresh, what are your thoughts?
This is a problem, esp. for Eager mode. It means if we implemented the mechanism with per-dimension hints, we would create reusable kernels for all SD Convolutions, and performance would go down. And I guess there will be broad regressions. Hence, again, recommendations are: |
Based on this recommendation it looks like there is a misunderstanding here. As far as I understand, in the graph mode the framework knows what dimensions are static and what aren't. But it's not necessary that those dimensions that aren't static will actually change at execution time. So even such models as ResNet-50 will be affected. |
@densamoilov, I've read @jgong5's comments once again, and I agree with you. It looks like in Eager mode everything is non-static. It means IPEX can't provide any useful information in Eager mode, and the information provided in Graph mode won't not accurate. Hence, regressions are expected in both cases. |
Based on this comment from @jgong5 it's not quite true:
Once the framework can distinguish between dynamic vs static size activations then we can definitely address the following scenario:
By "address" I mean we can provide much lower primitive creation time at the cost of worse kernel performance. This is possible under the assumption the framework will propagate this information (that we are in the graph mode with dynamic size activations) to oneDNN. One important caveat here is that static vs dynamic size activations distinction must be clear. At this point oneDNN got requests only about a few models with dynamic sizes while most models work with static activation sizes. So the expectation is that the framework can reliably detect the first set of models (with dynamic size activations) without any changes in how we handle the other set (with static size activations). |
@jgong5, can you please address the confusion?
|
PyTorch graph mode allows users to specify if the shapes are dynamic or static. In both cases, example inputs with specific shapes are known. For dynamic shapes, the runtime dims might change. For static shapes, those dims won't change and can be specialized during compilation. Does that address your confusion? |
Link to the rendered document