MPI jit usage #564

mclarsen · 2020-06-20T22:03:30Z

mclarsen
Jun 20, 2020

I have an mpi program that needs to JIT compile kernels at runtime, and I am wondering about some of the logistics. Do I need to worry about all ranks compiling the same kernel? I assume that I would only want a single rank to compile the kernel, and have all the other ranks use it.

dmed256 · 2020-06-20T22:43:29Z

dmed256
Jun 20, 2020
Maintainer

Good question. There are a few ways to avoid this issue:

NFS (Network File System)

If you are in one node or have some NFS directory, you can set the OCCA_CACHE_DIR environment variable there. We use directory-based locks to avoid process-based race conditions.

The downside to this is that multiple MPI ranks could potentially be waiting for 1 rank to finish compiling the kernel

Warmup

You could also compile all the kernels you need before hand and ship them to the rank directories (NFS doesn't matter, as long as the binaries are available in the machine).

Other things we've thought about but haven't implemented yet

CLI

We've thrown an idea around of being able to record kernels that get compiled and use the ./bin/occa CLI to make it easy to pre-compile kernel.

Async builds

Right now buildKernel is a synchronous process even though it could be done in the background. In this case we would have kernel calls block if the kernel hasn't finished compiling. This would let MPI ranks "compete" to compile certain kernels without blocking the main thread in each rank.

0 replies

mclarsen · 2020-06-20T23:01:34Z

mclarsen
Jun 20, 2020
Author

In my use case, I won't know what the kernel is until runtime, but it will likely be the same kernel over and over. I also don't care too much about ranks waiting for the compilation, since hopefully this happens only the first time its created. I have crashed entire file systems writing many very small files from all ranks, and, of course for other reasons. I will have a parallel file system, so I guess I would go with the first option.

Is the right way to go about this something like:

occa::kernel;
if(rank == 0)
{
  kernel = device.buildKernelFromString(str, "map");
}
MPI_Barrier(comm);

kernel = device.buildKernelFromString(str, "map");

0 replies

mclarsen · 2020-06-20T23:31:14Z

mclarsen
Jun 20, 2020
Author

In an ideal world, rank 0 be the only one compiling and reading in the kernel from disc. It would then broadcast "it" to all the other ranks. That said, i don't know how the sausage is made and if that would even work.

0 replies

dmed256 · 2020-06-20T23:33:51Z

dmed256
Jun 20, 2020
Maintainer

If you set the OCCA_CACHE_DIR to the parallel file system then you probably won't have to worry about adding that MPI_BARRIER

occa::kernel kernel = device.buildKernelFromString(str, "map");

Whichever rank first gets to the kernel compilation step first will create a lock file inside

${OCCA_CACHE_DIR}/locks/<kernel-hash>

The lock file acts as a barrier where other ranks will wait until the lock is gone and read the binary file once it shows up. If the lock-file takes too long (~20 sec), a different process will take over and compile the file.

If you do find some issues with this process, maybe something like

if (rank == 0) {
  // Will build the kernel
  buildKernel();
  barrier();
} else {
  barrier();
  // Will pick up the cached kernel
  buildKernel();
}

0 replies

dmed256 · 2020-06-20T23:37:30Z

dmed256
Jun 20, 2020
Maintainer

In an ideal world, rank 0 be the only one compiling and reading in the kernel from disc. It would then broadcast "it" to all the other ranks. That said, i don't know how the sausage is made and if that would even work.

Oh interesting, we could add another method that lets you do this. Right now we have

std::string binaryFilename = kernel.binaryFilename();
device.buildKernelFromBinary(binaryFilename, "map");

However, you probably want something more like

void *binary = kernel.getBinaryData();
// MPI send the binary everywhere
device.buildKernelFromBinaryData(binary, "map");

0 replies

mclarsen · 2020-06-20T23:52:05Z

mclarsen
Jun 20, 2020
Author

That would work as long as I know the size.

0 replies

dmed256 · 2020-06-21T00:10:27Z

dmed256
Jun 21, 2020
Maintainer

Looks like we do this in the OpenCL backend

size_t binaryBytes;
const char *binary = occa::io::c_read(kernel.binaryFilename(), &binaryBytes, true);
...
delete [] binary;

These helper methods aren't reaaaally meant to be used outside of OCCA, but maybe it'll help if you want to write some wrapper.

0 replies

tcew · 2020-07-30T22:21:39Z

tcew
Jul 30, 2020
Maintainer

@dmed256 I assume that you are wary of introducing MPI as a dependency to OCCA.

However, it would be useful to have an option to have the dependency and a method that does something like:

fooKernel = device.mpiBuildKernelFromSource("foo.okl", "root: 0");

if MPI is not available this would default to the standard buildKernel.

I think it is important to delineate this as a process involving MPI, since the behavior is distinctly different to non-MPI.

This is a super important feature needed for building on leadership facility systems.

0 replies

mclarsen · 2020-07-31T03:47:45Z

mclarsen
Jul 31, 2020
Author

I think it would be fine without MPI if all the functions are there to build it on one rank and broadcast it to the others.

0 replies

cyrush · 2020-08-28T21:08:05Z

cyrush
Aug 28, 2020

I agree with @mclarsen, I don't think you need an explicit mpi dependency. If you can provide a mechanism to fetch the binary blob that represents the code from the existing kernel (or from the device build step), and then a way to reconstruct a kernel from that blob -- thats enough to support this use case. The MPI plumbing won't be hard and can be left up to the calling code base.

How about adding:

void * kernel::getBinaryData(size_t &binaryBytes);

used like:

size_t binaryBytes;
void *binary = kernel.getBinaryData(binaryBytes);

And then:
device::buildKernelFromBinaryData(void *binary, size_t binaryBytes, const std::string &kernelName);

used like:

device.buildKernelFromBinaryData(binary, binaryBytes, "map");

I think know a path to support this for cuda + opencl. Other backends can throw exception if they can't support (following a pattern like in the metal backend) ?

Happy to take a pass at this if that sounds like its the right approach.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MPI jit usage #564

{{title}}

Replies: 10 comments

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Select a reply

MPI jit usage #564

mclarsen Jun 20, 2020

Replies: 10 comments

dmed256 Jun 20, 2020 Maintainer

NFS (Network File System)

Warmup

CLI

Async builds

mclarsen Jun 20, 2020 Author

mclarsen Jun 20, 2020 Author

dmed256 Jun 20, 2020 Maintainer

dmed256 Jun 20, 2020 Maintainer

mclarsen Jun 20, 2020 Author

dmed256 Jun 21, 2020 Maintainer

tcew Jul 30, 2020 Maintainer

mclarsen Jul 31, 2020 Author

cyrush Aug 28, 2020

mclarsen
Jun 20, 2020

dmed256
Jun 20, 2020
Maintainer

mclarsen
Jun 20, 2020
Author

mclarsen
Jun 20, 2020
Author

dmed256
Jun 20, 2020
Maintainer

dmed256
Jun 20, 2020
Maintainer

mclarsen
Jun 20, 2020
Author

dmed256
Jun 21, 2020
Maintainer

tcew
Jul 30, 2020
Maintainer

mclarsen
Jul 31, 2020
Author

cyrush
Aug 28, 2020