Replies: 10 comments
-
Good question. There are a few ways to avoid this issue: NFS (Network File System)If you are in one node or have some NFS directory, you can set the The downside to this is that multiple MPI ranks could potentially be waiting for 1 rank to finish compiling the kernel WarmupYou could also compile all the kernels you need before hand and ship them to the rank directories (NFS doesn't matter, as long as the binaries are available in the machine). Other things we've thought about but haven't implemented yet CLIWe've thrown an idea around of being able to record kernels that get compiled and use the Async buildsRight now |
Beta Was this translation helpful? Give feedback.
-
In my use case, I won't know what the kernel is until runtime, but it will likely be the same kernel over and over. I also don't care too much about ranks waiting for the compilation, since hopefully this happens only the first time its created. I have crashed entire file systems writing many very small files from all ranks, and, of course for other reasons. I will have a parallel file system, so I guess I would go with the first option. Is the right way to go about this something like:
|
Beta Was this translation helpful? Give feedback.
-
In an ideal world, rank 0 be the only one compiling and reading in the kernel from disc. It would then broadcast "it" to all the other ranks. That said, i don't know how the sausage is made and if that would even work. |
Beta Was this translation helpful? Give feedback.
-
If you set the occa::kernel kernel = device.buildKernelFromString(str, "map"); Whichever rank first gets to the kernel compilation step first will create a lock file inside
The lock file acts as a barrier where other ranks will wait until the lock is gone and read the binary file once it shows up. If the lock-file takes too long (~20 sec), a different process will take over and compile the file. If you do find some issues with this process, maybe something like if (rank == 0) {
// Will build the kernel
buildKernel();
barrier();
} else {
barrier();
// Will pick up the cached kernel
buildKernel();
} |
Beta Was this translation helpful? Give feedback.
-
Oh interesting, we could add another method that lets you do this. Right now we have std::string binaryFilename = kernel.binaryFilename();
device.buildKernelFromBinary(binaryFilename, "map"); However, you probably want something more like void *binary = kernel.getBinaryData();
// MPI send the binary everywhere
device.buildKernelFromBinaryData(binary, "map"); |
Beta Was this translation helpful? Give feedback.
-
That would work as long as I know the size. |
Beta Was this translation helpful? Give feedback.
-
Looks like we do this in the OpenCL backend size_t binaryBytes;
const char *binary = occa::io::c_read(kernel.binaryFilename(), &binaryBytes, true);
...
delete [] binary; These helper methods aren't reaaaally meant to be used outside of OCCA, but maybe it'll help if you want to write some wrapper. |
Beta Was this translation helpful? Give feedback.
-
@dmed256 I assume that you are wary of introducing MPI as a dependency to OCCA. However, it would be useful to have an option to have the dependency and a method that does something like:
if MPI is not available this would default to the standard buildKernel. I think it is important to delineate this as a process involving MPI, since the behavior is distinctly different to non-MPI. This is a super important feature needed for building on leadership facility systems. |
Beta Was this translation helpful? Give feedback.
-
I think it would be fine without MPI if all the functions are there to build it on one rank and broadcast it to the others. |
Beta Was this translation helpful? Give feedback.
-
I agree with @mclarsen, I don't think you need an explicit mpi dependency. If you can provide a mechanism to fetch the binary blob that represents the code from the existing kernel (or from the device build step), and then a way to reconstruct a kernel from that blob -- thats enough to support this use case. The MPI plumbing won't be hard and can be left up to the calling code base. How about adding:
used like:
And then: used like:
I think know a path to support this for cuda + opencl. Other backends can throw exception if they can't support (following a pattern like in the metal backend) ? Happy to take a pass at this if that sounds like its the right approach. |
Beta Was this translation helpful? Give feedback.
-
I have an mpi program that needs to JIT compile kernels at runtime, and I am wondering about some of the logistics. Do I need to worry about all ranks compiling the same kernel? I assume that I would only want a single rank to compile the kernel, and have all the other ranks use it.
Beta Was this translation helpful? Give feedback.
All reactions