Overcoming bottlenecks in cloud-native Machine Learning workflows #117

weiji14 · 2023-07-28T00:41:27Z

weiji14
Jul 28, 2023
Maintainer

Background

To keep the utilization rate of GPUs at a high level whilst training neural networks, the data loading pipeline needs to be able to match the GPU in speed. Assuming that the data preprocessing mostly happens on a CPU, then this usually involves some form of concurrency and/or parallelism.

However, Python 3.x (specifically CPython) has the Global Interpreter Lock (GIL), which makes it fast for single thread applications, but sub-optimal for multi-threading/parallelization, see https://realpython.com/python-gil and other sources for more information.

CPU/RAM/IO-bound limitations

So to make parallelization possible, several methods are used, and which one to use depends on what your processing is bounded by:

	What to use	Pros	Cons
CPU-bound (a lot of math processing)	Multiprocessing on different CPU cores or GPU nodes	Ideal for machine learning and data analytics	Assumes IO is minimal, otherwise inter-compute-device communication latency is painful. Can use up a lot of RAM to duplicate data across nodes
RAM-bound (a lot of data in memory)	Multithreading or Iterable generators. Also other memory saving techniques like lightweight dtypes, model sharding, etc.	Uses a shared memory space with threading, or process data one at a time with generators to save on RAM	Not all Python code can be multi-threaded, and it can be unstable due to GIL issue, causing deadlocks and race conditions. Iterable generators can be slow due to one-by-one serial processing
I/O-bound (a lot of network latency)	Concurrency (cooperative multi-tasking)	Overcomes slow I/O by using multiple co-routines, can access shared RAM like threading	Async code is not easy to write in Python, need to deal with event loops, etc.

Or as this diagram from Dask-ML (https://ml.dask.org/#dimensions-of-scale) shows:

Note that the I/O dimension isn't mentioned, though dask does have an advanced mode to support async/await operations (https://docs.dask.org/en/stable/deploying-python-advanced.html#start-many-in-one-event-loop).

Breaking speeds in cloud-native workflows

One could argue that 'cloud' processing virtually eliminates compute (CPU) and memory (RAM) limitations, given enough $$ resources. The main bottleneck that remains is thus in communication overhead with I/O operations. There are several aspects to this:

Networking - Put your data and compute device in the same region/cloud infrastructure to minimize inter-network latency.
Inter-device transfer - Moving data between the CPU and GPU can be a bottleneck, look into tools like RAPIDS AI, NVIDIA DALI or similar to do everything on the GPU!
Chunked storage - File formats like Zarr and Cloud-Optimized GeoTIFFs allow concurrent/parallel access to different 'chunks' or subsets of data, use these whenever possible!

Miss any one of the above (e.g. not working in the same cloud region, doing mixed CPU-GPU processing, reading from non cloud-optimized HDF5 or other files) can result in latency. Ideally, you would tackle these at the root, but not everyone has the privilege of gaining access to architecting the cloud infrastructure or working with the latest file formats.

So what is the solution then to overcome latency?

Redesign for async?

References:

RichardScottOZ · 2023-09-20T21:47:44Z

RichardScottOZ
Sep 20, 2023

On this - do you know of large scale xbatcher examples in this space? e.g. xarray overhead on top of other things for deep learning will slow things down?

1 reply

weiji14 Sep 20, 2023
Maintainer Author

Not sure what you mean by large scale, but I'd point to https://projectpythia.org/xbatcher-ML-1-cookbook/README.html for an xbatcher example. Generally I've found xarray overhead to be minimal when reading from an already pre-processed data store (e.g. Zarr):

But if you need to apply some pre-processing operations after loading from file (e.g. reprojection, filtering out NaNs, etc), and the file access happens through a slow network, then things will be slow. Not sure if that answers your question though.

RichardScottOZ · 2023-09-20T23:43:27Z

RichardScottOZ
Sep 20, 2023

Large scale - I guess things with billions of pixels upwards, not simple examples that just look at a satellite scene or two. Cloud rate limiting/bandwidth can be 'slow network' in that sense.

2 replies

weiji14 Oct 27, 2023
Maintainer Author

Sorry for the late reply, was busy for the past month with the FOSS4G SotM Oceania conference (should catch up in-person next year if you're attending)!

Again, probably not answering your question directly, but I've got some code at https://github.com/weiji14/foss4g2023oceania that benchmarks data loading from an ERA5 subset with (data_vars: 3, time: 1464, latitude: 721, longitude: 1440) = $4.56 \times 10^9 \text{pixels}$ (i.e. 4.56 billion pixels), and going through xbatcher too. I've focused mainly on the CPU-GPU inter-device transfer component, comparing an experimental GPUDirect Storage kvikIO engine, with the CPU-based zarr engine of xarray. More details are at zarr-developers/zarr-benchmark#14 if you're interested, but a quick takeaway is that going storage-direct-to-GPU can take ~25% less time than the equivalent storage-to-CPU-then-GPU path (with caveats).

Essentially, going full GPU-only (which is highly async/parallelized) would be the fastest you can make things go right now (at least until commercial quantum computers gain traction). I'd like to think that the 'xarray' overhead you mention above and at https://discourse.pangeo.io/t/favorite-way-to-go-from-netcdf-xarray-to-torch-tf-jax-et-al/2663 is fairly minimal, or obsolete once you work with cloud-optimized file formats like Zarr and GPU-native technology like kvikIO, though of course, these assume that you have control over the whole infrastructure 'stack' and might not apply for every use-case.

RichardScottOZ Oct 27, 2023

Thanks, no, that is certainly in the ballpark.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Overcoming bottlenecks in cloud-native Machine Learning workflows #117

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 2 comments 3 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Overcoming bottlenecks in cloud-native Machine Learning workflows #117

weiji14 Jul 28, 2023 Maintainer

Background

CPU/RAM/IO-bound limitations

Breaking speeds in cloud-native workflows

Replies: 2 comments · 3 replies

RichardScottOZ Sep 20, 2023

weiji14 Sep 20, 2023 Maintainer Author

RichardScottOZ Sep 20, 2023

weiji14 Oct 27, 2023 Maintainer Author

RichardScottOZ Oct 27, 2023

weiji14
Jul 28, 2023
Maintainer

Replies: 2 comments 3 replies

RichardScottOZ
Sep 20, 2023

weiji14 Sep 20, 2023
Maintainer Author

RichardScottOZ
Sep 20, 2023

weiji14 Oct 27, 2023
Maintainer Author