Hi, I noticed that some schedules in topi.cuda use thread block of size 512 in one dimension.
For example, here and here.
512 threads in one dimension is too big for AMD OpenCL, because their driver restricts the maximum blockDim.x to be 256. According to the discussion here, this restriction is due to their implementation of driver, not their hardware itself.
I understand that schedules in topi/cuda are optimized for CUDA, but if I change the 512 threads to 256 threads, all tests in https://github.com/dmlc/tvm/tree/master/topi/tests/python pass for my AMD gpu as well. I was also able to run tutorials and resnet benchmark from the latest nnvm, on AMD gpu.
So, unless there is a strong reason to use thread block of size 512, is it possible to change 512 to 256?
Of course, if you are planning to implement dedicated schedules for opencl, that's better :)