-
Notifications
You must be signed in to change notification settings - Fork 769
[SYCL] Introduce the Level Zero plugin #1718
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Plugin itself consists of the header and the source file plus cmake file to build the plugin. Also the following changes were made to suport the Level Zero plugin in SYCL RT: * New level0 value was added to backend enum * New PI_LEVEL0 value support was added to SYCL_BE config. * Docs were updated. Mentioned Level Zero backend and provided the link to the Level Zero runtime for Intel GPU. * Changes in sycl cmake file to build level0 plugin by default and to install it with sycl toolchain. LIT testing with PI_LEVEL0 backend will be enabled in the following commits. This commits introduces the plugin and makes it buildable. Signed-off-by: Artur Gainullin <artur.gainullin@intel.com>
L0 plugin doesn't support piProgramCompile/piProgramLink commands, program is built during piProgramCreate. Signed-off-by: Artur Gainullin <artur.gainullin@intel.com>
Please don't review the fist commit "[SYCL] Untie PI functions from OpenCL" which is reviewed separately #1717 |
* Fix clang-format issue * Fix using statement Signed-off-by: Artur Gainullin <artur.gainullin@intel.com>
sycl/doc/GetStartedGuide.md
Outdated
Please, refer to [the Release Notes](../ReleaseNotes.md) for recommended Intel | ||
runtime versions. | ||
|
||
The `GPU` runtime that is needed to run DPC++ application on Intel `GPU` devices | ||
can be downloaded from the following web pages: | ||
To run DPC++ application on Intel `GPU` devices the OpenCL `GPU` runtime or the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
#1699 have this wording changed a bit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thx, according to new wording I just need to specify the version of the L0 runtime in buildbot/dependency.conf. I am not sure if the purpose of this file is just to provide information about version or if it used in some scripts. I will ask people working on CI and then will update this file.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
buildbot/dependency.conf is used in CI scripts. Added comment to this file explaining that there is a same driver for Level Zero and OpenCL,
Align with changes made for OpenCL plugin in intel#1638: * Fix build on Windows by defining __SYCL_BUILD_SYCL_DLL in cmake file * Export only pi* symbols in libpi_level0.so * Add test to check exported symbols Signed-off-by: Artur Gainullin <artur.gainullin@intel.com>
Signed-off-by: Artur Gainullin <artur.gainullin@intel.com>
* Add link to the Level Zero specification * Revert changes in the section providing links to download low level runtimes. Links are supposed to be provided in buildbot/dependency.conf according to upcoming changes intel#1699 Signed-off-by: Artur Gainullin <artur.gainullin@intel.com>
@againull Could you please PR your patch even more? Some ideas: all changes not in Level0 dir go to a separate PRs; group PI API implementations(queue related, program related) and create separate PRs for each of such group. |
@romanovvlad did you want to write this comment to #1723 ? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a few comments, nice to see more non-opencl plugins added :-)
break; | ||
case ZE_RESULT_ERROR_UNKNOWN: | ||
ErrorString = "ZE_RESULT_ERROR_UNKNOWN"; | ||
break; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be easier to maintain with a macro for each case, e.g.
switch(ZeError) {
#define ZE_ERRCASE(ERR)\
case ERR:\
ErrorString = ""#ERR;\
break;
ZE_ERRCASE(ZE_RESULT_ERROR_OVERLAPPING_REGIONS);
ZE_ERRCASE(ZE_RESULT_ERROR_UNKNOWN);
// ...
#undef ZE_ERRCASE
} // switch
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the code, fixed.
void (*PFnNotify)(pi_event Event, | ||
pi_int32 EventCommandStatus, | ||
void *UserData), | ||
void *UserData) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Callbacks are not used by SYCL-RT anymore. Among other problems, they caused performance issues on the CUDA backend. This code is not used, could be removed - we are going to mark it as not implemented in the CUDA backend soon.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed the body. I think this functions should be removed from PI in the separate commit as they are not used in SYCL RT.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree
return PI_SUCCESS; | ||
} | ||
|
||
pi_result piEventSetStatus(pi_event Event, pi_int32 ExecutionStatus) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not used by SYCL RT anymore (thanks @s-kanaev !)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Made same change as for piEventSetCallback.
&ZeSamplerDesc, // TODO: translate properties | ||
&ZeSampler)); | ||
|
||
*RetSampler = new _pi_sampler(ZeSampler); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On error or bad_alloc this should return error code
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, fixed.
if (Res != PI_SUCCESS) | ||
return Res; | ||
|
||
piEventCreate(Queue->Context, Event); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is piEventCreate twice?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, fixed.
assert(Flags == PI_MEM_FLAGS_ACCESS_RW); | ||
assert(BufferCreateType == PI_BUFFER_CREATE_TYPE_REGION); | ||
assert(!(static_cast<_pi_buffer *>(Buffer))->isSubBuffer() && | ||
"Sub-buffer cannot be partitioned"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this a level0 restriction or just keeping the existing SYCL/OpenCL limitations?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are right, just keeping the existing SYCL/OpenCL limitations.
|
||
// L0 doesn't do the reference counting, so we have to do. | ||
// Must be atomic to prevent data race when incrementing/decrementing. | ||
std::atomic<pi_uint32> RefCount; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is also a problem/question we have on the CUDA plugin: Is reference counting required by SYCL-RT? seems from what I understand of the SYCL-RT code, Increment/Decrement of all PI objects is only used for interoperability.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As far as I understand we need to do reference counting to properly release the allocated resources.
@smaslov-intel Could you please give your input here.
Signed-off-by: Artur Gainullin <artur.gainullin@intel.com>
Signed-off-by: Artur Gainullin <artur.gainullin@intel.com>
* Use macro in zeParseError function which is easier to maintain * Fix env variable naming * Remove implementation of piEventSetCallback and piEventSetStatus, make them deprecated. These functions should be removed from PI in the separate patch because they are not used in the SYCL RT. Signed-off-by: Artur Gainullin <artur.gainullin@intel.com>
usm_device_allocations = PI_USM_DEVICE_SUPPORT, | ||
usm_host_allocations = PI_USM_HOST_SUPPORT, | ||
usm_shared_allocations = PI_USM_SINGLE_SHARED_SUPPORT, | ||
usm_device_allocations = PI_USM_DEVICE_SUPPORT, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
formatting changes unrelated to this PR
Signed-off-by: Artur Gainullin <artur.gainullin@intel.com>
* Added license header to pi_level0.cpp/pi_level0.hpp * use unordered_map instead of map where possible * Refactor _pi_mem::addMapping according to suggestion * Add sanity checks to piProgramCompile and piPrgoramBuild * Narrowed extern "C" to exported functions. * Other minor fixes Signed-off-by: Artur Gainullin <artur.gainullin@intel.com>
Signed-off-by: Artur Gainullin <artur.gainullin@intel.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very interesting.
A few questions and nitpicks.
class ZeCall { | ||
private: | ||
// The global mutex that is used for total serialization of L0 calls. | ||
static std::mutex GlobalLock; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this efficient?
// Controls L0 calls tracing in zePrint. | ||
static bool ZeDebug = false; | ||
|
||
static void zePrint(const char *Format, ...) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a need to provide old unsafe C varargs interface in a C++17 code?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Correction. Library source code is still uses C++14.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Anyway, does it requires to go down to C99?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And I do not remember the real reason for C++14 only if SYCL goes to C++17?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AFAIK, a few SYCL features requiring C++17 are implemented in "header-only" style, so they are not requiring the runtime library to be built with C++17 enabled.
We keep using C++14 features in the library to keep the libstdc++ requirement low and make the implementation more portable.
// write (as it is OK for multiple threads to read the map without locking it). | ||
|
||
pi_result _pi_mem::addMapping(void *MappedTo, size_t Offset, size_t Size) { | ||
std::lock_guard<std::mutex> Lock(MappingsMutex); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
std::lock_guard<std::mutex> Lock(MappingsMutex); | |
std::lock_guard Lock { MappingsMutex }; |
C++17
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@keryell The library is compiled with C++14
} | ||
|
||
pi_result _pi_mem::removeMapping(void *MappedTo, Mapping &MapInfo) { | ||
std::lock_guard<std::mutex> Lock(MappingsMutex); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
C++17...
size_t Size; | ||
}; | ||
|
||
virtual ~_pi_mem() = default; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Move this at the end of the class.
Because my first comment here was "Why do you need this?" :-)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Moved, thx.
|
||
// L0 doesn't do the reference counting, so we have to do. | ||
// Must be atomic to prevent data race when incrementing/decrementing. | ||
std::atomic<pi_uint32> RefCount; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps you could move this into a (CRTP?) mixin class to avoid all these copy-pastes in every class or so?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed, thx.
@@ -0,0 +1,81 @@ | |||
# PI Level0 plugin library | |||
|
|||
message(STATUS "Download Level Zero loader and headers from github.com") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we have an option to provide local version of the L0 headers and libraries like it is done for OpenCl dependencies?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can do this. But I have a question about how it is done for OpenCL. In configure.py:
21 ocl_header_dir = os.path.join(abs_obj_dir, "OpenCL-Headers")
22 icd_loader_lib = os.path.join(abs_obj_dir, "OpenCL-ICD-Loader", "build")
These paths are used if we pass --system-ocl to configure. But they are not system paths.
So this option/behavior is a little confusing. Shouldn't we look system paths (like /usr/local/lib or smth) if such option is provided?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added ability to provide local version of headers and loader.
DEPENDS ocl-headers | ||
BUILD_BYPRODUCTS ${L0_LIBRARY} | ||
) | ||
ExternalProject_Add_Step(l0-loader llvminstall |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like l0-loader is not added to deploy-sycl-toolchain. So it will not be available after recommended build process. So if we build compiler and runtime on machine with GPU it might be non-functional due to missed dependency library (l0-loader)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@vladimirlaz I have a question, do we deploy OpenCL.so? I tried to do ninja install and don't see ICD in the installation directory. How this works for OpenCL?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@againull you need to download it on build time (rather than using pre-installed) and specify a special flag on cmake configuration. See fd64c8f#diff-ad9ca624b2a795a5f24145a17c806d1f
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added l0-loader to deploy-sycl-toolchain. One question: how the scenario described by Vladimir will work for OpenCL if ICD is added to deploy-sycl-toolchain only under special option?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
77e8a78
Signed-off-by: Artur Gainullin <artur.gainullin@intel.com>
Conflict with master branch (in dependency.conf) appeared right just after all approvals. Merging master branch with conflict resolving dismissed all of them:( |
I think you've got all approvals, so let's wait for clean testing and merge. |
There were no changes made after approvals expect merging intel/sycl. Problem on cuda target fixed. Can we merge this PR? |
Can I use the description as a commit message or do you want to update it? |
Yes, description is ok for commit message. |
Plugin itself consists of the header and the source file plus cmake
file to build the plugin. Also the following changes were made to
suport the Level Zero plugin in SYCL RT:
* New level0 value was added to the backend enum
* New PI_LEVEL0 value support was added to the SYCL_BE config.
* Docs were updated. Mentioned Level Zero backend and provided the
link to the Level Zero runtime for Intel GPU.
* Changes in sycl cmake file to build Level Zero plugin by default and to
install it with sycl toolchain.
LIT testing with PI_LEVEL0 backend will be enabled in the following
commits. This commits introduces the plugin and makes it buildable.
Signed-off-by: Artur Gainullin artur.gainullin@intel.com