I noticed that for a 1D input tensor, we can define index space in such a way, that max 2 TPC cores are utilized (as in example To use 4 TPCs, tensor must be 2D. What I want to achieve is to have a 1D tensor and divide the load equally into all TPC cores. So for a 1D tensor of shape size 512 want each TPC core to handle 64 elements. But all I can accomplish is 2 TPC each handling 256 elements. Why is that?
int elementsInVec = 64;
unsigned depthIndex = (outputSizes[0] + (elementsInVec - 1)) / elementsInVec;
kernel->indexSpaceGeometry.dims = 1;
kernel->indexSpaceGeometry.sizes[0] = depthIndex;
kernel->inputTensorAccessPattern[0].dim[0].dim = 0;
kernel->inputTensorAccessPattern[0].dim[0].start_a = elementsInVec;
kernel->inputTensorAccessPattern[0].dim[0].end_a = elementsInVec;
kernel->inputTensorAccessPattern[0].dim[0].start_b = 0;
kernel->inputTensorAccessPattern[0].dim[0].end_b = elementsInVec - 1;
I defined the mapping as:
- startF(x) = 64*x + 0
- endF(x) = 64*x+63
but it seems that it is ignored and instead it behaves more as if the mapping was:
- startF(x) = 256*x + 0
- endF(x) = 256*x+255
What values is x actually gonna be? [0,1] ? What is wrong with my code? Is it even possible to launch 8 TPC for a data layout like this?