Description
In the discussion from here: #1246 (comment)
it was described that urKernelSuggestMaxCooperativeGroupCountExp
maps to cudaOccupancyMaxActiveBlocksPerMultiprocessor
which takes a kernel and other params, and returns the maximum number of blocks that can be simultaneously executed in a streaming multiprocessor (SM).
However I found this in the l0 documentation:
"Use zeKernelSuggestMaxCooperativeGroupCount to recommend max group count for device for cooperative functions that device supports."
The "device" word implies that the semantics of of urKernelSuggestMaxCooperativeGroupCountExp
is the maximum number of blocks that can be simultaneously executed in a device. A device consists of multiple streaming multiprocessors. In such a case you need to multiply the max number of blocks that can be simultanously executed in a SM by the number of SMs in a device.
The number of SMs can only be retrieved by querying the device the kernel is to be run on. This information (the device to be run on) is not passed to urKernelSuggestMaxCooperativeGroupCountExp
, nor can it be inferred from any of the other parameters.
Therefore, there are two possibilities:
- if the semantics is the max number of blocks per device, the interface needs to be changed.
- if the semantics is the max number of blocks per SM, the documentation should be clarified IMO.