This little project provides an implementation of a storage API for SYCL. The particularity of this API is that the calls can be emitted straight from a running SYCL kernel, without any interruption. This API communicates with the host using remote procedure calls. It is designed to leverage hardware capabilities of devices such as NVIDIA's GPUDirect Storage which completely bypasses the host and allows the GPU to store and read data from a compatible file system. Without the DMA support, this api works in an emulated mode and uses the host's OS to store data. This emulated mode works on any filesystem, unlike the DMA one.
fs<T> fs(q, 1, tmp_space);
q.submit([&](handler &cgh) { /* Works on a GPU */
fs_accessor<T> storage_accessor = fs.get_access();
cgh.single_task([=]() {
/* Loading a picture using a C++ decoder on the host */
storage_accessor.load_image(0, "Cat.jpg", tmp_accessor); /* Returns the picture size on success */
/* Do your thing and then store the data in a simple file */
if (auto fh = storage_accessor.open<fs_mode::write_only>(0, "Neural_Network_Result.dat")) {
fh->write(data_buffer, data_length);
fh->close();
}
});
}).wait();
- Concurrent filesystem I/O from all the work-groups and work-items.
- Leveraging device's hardware IO features to bypass the host OS (DMA).
- Possibility to open a file across a whole work-group to increase throughput and avoid the divergence of the control flow.
- Possibility to use host libraries to encode/decode files, images, audio, video, streams, etc see
- USM and accessor interface.
- Dynamic threaded access to the underlying filesystem (enabled by default).
- I/O latency control to reduce pressure on the CPU.
- POSIX functionalities (opening modes and seek)
- Type-safe & Header-based API.
- Dataset-size independent kernels example (see NVIDIA cooperative groups programming paradigm)
- Machine Learning: Using this API, one can train a network without having to ever stop the kernel to load new datasets. The SYCL kernel will be able to load the datasets, let's say pictures, itself. One will also be able to save the trained network at runtime (which would even allow inter-kernel communication).
- Real-time processing: one could open some character devices in parallel and process the data in real time. Could be image recognition on a video feed.
- Batch processing: See the example. With a single kernel launch and one fixed-size memory allocations, we can process an unbound number of files on the device (without having to re-alloc or re-launch kernels).
- Processing files that do not fit in the memory of the GPU: to process a petabyte dataset, on a regular GPU one would have to launch tons of kernels and manage the data. What if the kernel has to perform random accesses on the file? Now it can all be done from the kernel. See random_walk which couldn't be easily GPU accelerated otherwise.
- Parallel API: all work-items can call functions at the same time.
- From a SYCL kernel, one can choose to perform a function call in synchronous or asynchronous manner.
- The host can answer to the function calls in a synchronous or asynchronous manner and one can choose, from a kernel whether the host will spawn a thread. Useful if the functions expensive or blocking.
- Easy to set up: the user defines the functions that will be remotely executed and provides a runner function that will do the call on the host (probably a big
switch
). - Ability to choose the frequency with which the HOST will be answering the function calls, to avoid starvation.
- Exceptions thrown by the remote function call, on the host, are caught and the function call from the GPU will appear as failed.
- The user can now safely call
abort()
from a SYCL kernel (all pending calls are completed before crashing the host) and access a global timer.
The benchmark (in this repository) is able to get to a bandwidth of 17 GiB/s when reading files with an NVIDIA GPU (thanks to filesystem caching on the host). The same values can be observed on the CPU with OpenCL which suggests that the implementation is bound by the host's storage-controller/hdd
The detailed API documentations is here.
As with buffers, we create a fs
object on the host and then pass an accessor to the SYCL kernel:
#include "sycl_fs.hpp"
...
auto q = sycl::queue(sycl::gpu_selector{});
sycl::fs<T> reader(q, parallel_channel_count, max_read_len);
q.submit([&](sycl::handler &cgh) {
auto acc = reader.get_access();
cgh.single_task([=]() {
...
});
});
Right now the implementation supports only a fixed number of parallel communication channels. A channel cannot be accessed twice simultaneously. But one could use the work group ID to ensure there's no conflict (and to load a file into local memory, with a single thread, and run the computation on the work group).
The host frequency (see later) is set to 100000Hz (probably too much) which means that we cannot perform more than 100000 functions calls per channel, per second. This is done to avoid starvation.
In a kernel, on the device, a file can be opened with:
auto fh = acc.open<sycl::fs_mode::read_write>(0, "my_file.txt");
It returns a std::optional<fs_descriptor...>
containing the file descriptor on success
if(fh){ // Checking whether the file was successfully opened
size_t number_written = fh->write(a_message, message_size);
size_t number_read = fh->read(a_buffer, message_size, 0); //specifying the offset
fh->close();
}
All the sizes corresponds to the number of elements of type T
that we want to process/were processed.
The fs_mode
is mapped to the posix ones, but it's wrapped in this enum class
to abstract the platform.
When creating the fs
object, one could set the template parameter parallel_host_file_io
to true
. This will result in the parallel execution of the Remote Procedure Calls on the host.
Tested and working on hipSYCL with CUDA and OpenMP backends as well as on DPC++ with OpenCL and CUDA backends.
With CUDA something interesting is happening with the synchronous version of the RPC demo. With hipSYCL, every thread is executed in parallel while with DPC++, a barrier seems to be added between the synchronous function calls and the prints. See comment in code.
It has been tested with up to 30 000 simultaneous RPC calls to CPU blocking functions (calls in a parallel_for), done from a sycl::device
using the CUDA backend, with the parallel runners. With more
parallel calls, the OS starts complaining about too many threads.
Change the device selector if needed. Then in a build folder:
hipSYCL_dir=/path/to_install cmake .. -DHIPSYCL_TARGETS="omp;cuda:sm_75"
CXX=dpcpp_compiler cmake ..
Edit the cmake to change the targets.
- Socket support
- Asynchronous file descriptors in the kernel: useful only in DMA mode, else there should be no advantage.