-
Notifications
You must be signed in to change notification settings - Fork 790
[SYCL] Events caching for in-order queues in the L0 plugin #6643
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Fix for SYCL/Plugin/level_zero_device_scope_events.cpp test is here: intel/llvm-test-suite#1179 |
Event->IsDiscarded) && | ||
!Event->hasExternalRefs()) { | ||
CandidateForReuse = | ||
new _pi_event(Event->ZeEvent, Event->ZeEventPool, Context, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
where is this one going to be deleted? why do you need a to allocate a whole new PI event?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
where is this one going to be deleted?
Note. I will try to get rid of CandidateForReuse and use only LastCommandEvent per comment below.
pi_event objects get deleted in piEventRelease, we reuse only native handles. pi_event objects are only wrappers which get created and deleted in piEventRelease while ze_event handles are reused.
why do you need a to allocate a whole new PI event?
We need to allocate a whole pi_event when caching in the queue level. pi_event object encapsulates data associated with a particular command (kernel, mapped buffer, list of dependent events, etc), we need to take some actions on this data when the event is completed (release the kernel, free allocated memory, cleanup list of dependent events appropriately).
But as you've summarized in this comment #6643 (comment)
we reuse native event handles before pi_event is completed (for in-order queue we have more info about dependency between commands). We can't reuse whole pi_event object before its completion and we need to re-create pi_event using same native handles.
Example,
- kernelA -> pi_event1 {ze_event1, ptr to kernelA, deplist1}
- kernelB-> pi_event2 {ze_event2, ptr to kernelB, deplist2}
- reset ze_event1 -> at this point pi_event1 is not completed yet, we can't reuse whole pi_event1, because we still going to need info in pi_event1 like event list - deplist1. So we create another object pi_event3 with the same native handle ze_event1.
- kernelC-> pi_event3 {ze_event1, ptr to kernelC, deplist3}
As you see pi_event1 and pi_event3 have same native handle - ze_event1 but different data because they are associated with different commands and we need to keep all this info because neither pi_event1 nor pi_event3 is not completed yet and we are going to need this info when they are completed.
if (DisableEventsCaching) | ||
return false; | ||
|
||
bool ProfilingEnabled = (Properties & PI_QUEUE_PROFILING_ENABLE) != 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why can't we reuse events that have no external references to even if queue is with profiling?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess we can reuse, I will fix.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed.
* Get rid of unnecessary storage of events. * Make sure there are no leaks of created copies. * Lock event when accessing the data members * Reset event before reusing * Bump external reference count of the proxy event for each event which is externally visible. * Some clarification comments
HIP backend AtomicRef failures are unrelated. |
Hi @steffenlarsen, @sergey-semenov, could you please review these changes (as far as I know you contributed to L0 plugin) by any chance because Sergey Maslov is in sick leave. |
* Use stoi instead of atoi. * Get rid of piEventReleaseExternal, add comments to other event release functions. * Return queue lock which was removed accidentally.
sycl/doc/EnvironmentVariables.md
Outdated
@@ -187,6 +187,7 @@ variables in production code.</span> | |||
| `SYCL_PI_LEVEL_ZERO_DEVICE_SCOPE_EVENTS` | Any(\*) | Enable support of device-scope events whose state is not visible to the host. If enabled mode is SYCL_PI_LEVEL_ZERO_DEVICE_SCOPE_EVENTS=1 the Level Zero plugin would create all events having device-scope only and create proxy host-visible events for them when their status is needed (wait/query) on the host. If enabled mode is SYCL_PI_LEVEL_ZERO_DEVICE_SCOPE_EVENTS=2 the Level Zero plugin would create all events having device-scope and add proxy host-visible event at the end of each command-list submission. The default is 2, meaning only the last event in a batch is host-visible. | | |||
| `SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS` | Integer | When set to a positive value enables use of Level Zero immediate commandlists, which means there is no batching and all commands are immediately submitted for execution. Default is 0. Note: When immediate commandlist usage is enabled it is necessary to also set SYCL_PI_LEVEL_ZERO_DEVICE_SCOPE_EVENTS to either 0 or 1. | | |||
| `SYCL_PI_LEVEL_ZERO_USE_MULTIPLE_COMMANDLIST_BARRIERS` | Integer | When set to a positive value enables use of multiple Level Zero commandlists when submitting barriers. Default is 0. | | |||
| `SYCL_PI_LEVEL_ZERO_INORDER_QUEUE_REUSE_EVENTS` | Integer | This environment variable controls an optimization for in-order queues which allows to reuse uncompleted Level Zero events in scope of the same queue based on the dependency chain between commands. When set to 1 the plugin will not perform this optimization. When set to 2 the plugin will reuse only explicitely discarded device-scope events for in-order queues. When set to 3 the plugin will try to reuse all device-scope events based on reference counting. Default is 3. | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When set to 1 the plugin will not perform this optimization.
I suggest make 0 to mean no optimization.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like implementation is already such
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, thanks for noticing, fixed.
// Returns bool value indicating whether queue supports in-order optimization | ||
// for provided type of event. | ||
bool _pi_queue::supportsInOrderQueueOptimization(bool HostVisible, | ||
bool IsDiscarded) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There can be multiple different optimization of in-order queues, so maybe call this one more specifically?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Renamed to be more specific.
// Copy of the last command event which is suitable for reuse. | ||
// It will be put into the cache when new command is submitted to the in-order | ||
// queue. | ||
pi_event CandidateForReuse = nullptr; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pi_event CandidateForReuse = nullptr; | |
pi_event InOrderEventCandidateForReuse = nullptr; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Renamed.
// Controls the level of reusing events for in-order queues. | ||
static const enum CachingLevel { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// Controls the level of reusing events for in-order queues. | |
static const enum CachingLevel { | |
// Controls the level of reusing events for in-order queues. | |
static const enum InOrderQueueReuseEventsLevel { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed.
@@ -187,6 +187,7 @@ variables in production code.</span> | |||
| `SYCL_PI_LEVEL_ZERO_DEVICE_SCOPE_EVENTS` | Any(\*) | Enable support of device-scope events whose state is not visible to the host. If enabled mode is SYCL_PI_LEVEL_ZERO_DEVICE_SCOPE_EVENTS=1 the Level Zero plugin would create all events having device-scope only and create proxy host-visible events for them when their status is needed (wait/query) on the host. If enabled mode is SYCL_PI_LEVEL_ZERO_DEVICE_SCOPE_EVENTS=2 the Level Zero plugin would create all events having device-scope and add proxy host-visible event at the end of each command-list submission. The default is 2, meaning only the last event in a batch is host-visible. | | |||
| `SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS` | Integer | When set to a positive value enables use of Level Zero immediate commandlists, which means there is no batching and all commands are immediately submitted for execution. Default is 0. Note: When immediate commandlist usage is enabled it is necessary to also set SYCL_PI_LEVEL_ZERO_DEVICE_SCOPE_EVENTS to either 0 or 1. | | |||
| `SYCL_PI_LEVEL_ZERO_USE_MULTIPLE_COMMANDLIST_BARRIERS` | Integer | When set to a positive value enables use of multiple Level Zero commandlists when submitting barriers. Default is 0. | | |||
| `SYCL_PI_LEVEL_ZERO_INORDER_QUEUE_REUSE_EVENTS` | Integer | This environment variable controls an optimization for in-order queues which allows to reuse uncompleted Level Zero events in scope of the same queue based on the dependency chain between commands. When set to 0 the plugin will not perform this optimization. When set to 1 the plugin will reuse only explicitely discarded device-scope events for in-order queues. When set to 2 the plugin will try to reuse all device-scope events based on reference counting. Default is 2. | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| `SYCL_PI_LEVEL_ZERO_INORDER_QUEUE_REUSE_EVENTS` | Integer | This environment variable controls an optimization for in-order queues which allows to reuse uncompleted Level Zero events in scope of the same queue based on the dependency chain between commands. When set to 0 the plugin will not perform this optimization. When set to 1 the plugin will reuse only explicitely discarded device-scope events for in-order queues. When set to 2 the plugin will try to reuse all device-scope events based on reference counting. Default is 2. | | |
| `SYCL_PI_LEVEL_ZERO_INORDER_QUEUE_REUSE_EVENTS` | Integer | This environment variable controls an optimization for in-order queues which allows the reuse of uncompleted Level Zero events in scope of the same queue based on the dependency chain between commands. When set to 0 the plugin will not perform this optimization. When set to 1 the plugin will reuse only explicitly discarded device-scope events for in-order queues. When set to 2 the plugin will try to reuse all device-scope events based on reference counting. Default is 2. | |
For in-order queues we can reuse events even before they are completed and released. Simplified scheme looks like this:
submit command1 [] -> pi_event1 (ze_event1)
submit command2 [dep pi_event1] -> pi_event2 (ze_event2)
submit reset ze_event1
submit command3 [dep pi_event2] -> pi_event3 (ze_event1)
So, in this example same native handle ze_event1 is used for command 1 and command3.
There are several levels of caching supported:
For 3. external ref count is used to track number of external references. Event can be reused only when this number turns to zero.
E2E tests: intel/llvm-test-suite#1263