-
Notifications
You must be signed in to change notification settings - Fork 29
Xarray GPU optimization #771
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Changes from 16 commits
Commits
Show all changes
90 commits
Select commit
Hold shift + click to select a range
f69796c
first draft
negin513 32a8e32
adding headers
negin513 d2f7e0d
adding baseline image
negin513 d23c74f
update blog post
negin513 64b45e1
update chunking
negin513 95e5d65
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] b52e1e7
Apply suggestions from code review
negin513 a2416b3
moving profiling_screenshot1.png over
negin513 67f9c29
adding profiling screenshot
negin513 5d168be
update
negin513 8c292d3
screenshot 1 added
negin513 e9195c8
moving baseline png
negin513 304acb0
adding pngs for the plots
negin513 eccf86c
updates
negin513 dbd7f46
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] 77fd383
Update src/posts/gpu-pipeline/index.md
negin513 972b85b
some updates
negin513 7f79ed0
revision
negin513 1d4626f
adding flowchart
negin513 8c367e4
adding new flowchart
negin513 0807efc
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] 587586c
quick updates
negin513 d64e319
adding zstd benchmark
negin513 841ffce
adding zstd benchmark
negin513 c11ce34
updates
negin513 5361a89
more revisions
negin513 d13963c
adding index.md
negin513 bfa138d
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] f3d6191
merge conflict
negin513 1a03305
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] 6970e29
adding links
negin513 88faa93
add link to the repo and remove todo
negin513 e903e6a
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] 88ad95a
Update src/posts/gpu-pipeline/index.md
negin513 ac66f7c
adding DALI dataloader screenshot
negin513 136bdd8
improvements
negin513 65076ba
minor improvements
negin513 4083cd0
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] bafb34f
updates
negin513 2df1e72
updates
negin513 77898aa
updates and comments
negin513 98a96aa
updates after meeting
negin513 d0b4856
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] 4d8124d
updates & clean ups of the the blog post
negin513 7ab3039
improved performance chart
negin513 aac8647
merge conflict
negin513 7e31db2
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] 9135c67
comment addressed
negin513 a7ddc66
Update src/posts/gpu-pipeline/index.md
negin513 3728b1a
update dali
negin513 75fc193
max's comments
negin513 257a4b5
update benchmark
negin513 ca5ff74
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] 50773a9
Update src/posts/gpu-pipeline/index.md
negin513 6f666a9
update to blogpost
negin513 8be9c11
Update src/posts/gpu-pipeline/index.md
negin513 c017e15
Update src/posts/gpu-pipeline/index.md
negin513 508af65
Update src/posts/gpu-pipeline/index.md
negin513 43e066a
Update src/posts/gpu-pipeline/index.md
negin513 c842b44
Update src/posts/gpu-pipeline/index.md
negin513 a51f276
Update src/posts/gpu-pipeline/index.md
negin513 bc3cbed
Update src/posts/gpu-pipeline/index.md
negin513 993f772
Update src/posts/gpu-pipeline/index.md
negin513 a5c387e
Update src/posts/gpu-pipeline/index.md
negin513 6359e46
updates
negin513 4355691
address comments
negin513 b13036a
update thank you messages
negin513 b35db32
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] 17352a3
Update src/posts/gpu-pipeline/index.md
negin513 847ac32
improve TLDR
negin513 278d6ab
python3
negin513 261bb0a
icechunk
negin513 90d3acb
hiding I/O latency
negin513 120502c
NSight
negin513 6a37f12
updates
negin513 1c837ec
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] 6b3b5b7
all in all
negin513 7d1e8b3
adding all in all improevemnts
negin513 61ed1f9
remove extra pictures
negin513 affc047
performance improvement noted
negin513 57d0845
performance improvement noted
negin513 c6a8cf1
performance improvement in bold
negin513 def6be3
Merge branch 'main' into hackathon
negin513 c2c7f45
update to the figure
negin513 631f40f
moved to images
negin513 91bb2fe
scaling plot was not rendering
negin513 5d17176
update scaling plot
negin513 816838c
update scaling plots
negin513 5770be4
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] 37fa44c
update date
dcherian File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,245 @@ | ||
--- | ||
title: 'Accelerating AI/ML Workflows in Earth Sciences with GPU-Native Xarray and Zarr (and more!)' | ||
date: '2025-05-01' | ||
|
||
authors: | ||
- name: Negin Sobhani | ||
github: negin513 | ||
- name: Wei Ji Leong | ||
github: weiji14 | ||
- name: Max Jones | ||
github: maxjones | ||
negin513 marked this conversation as resolved.
Show resolved
Hide resolved
|
||
- name: Akshay Subranian | ||
negin513 marked this conversation as resolved.
Show resolved
Hide resolved
|
||
github: akshaysubr | ||
- name: Thomas Augspurger | ||
github: tomaugspurger | ||
- name: Katelyn Fitzgerald | ||
negin513 marked this conversation as resolved.
Show resolved
Hide resolved
|
||
github: kafitzgerald | ||
|
||
summary: 'How to accelerate AI/ML workflows in Earth Sciences with GPU-native Xarray and Zarr.' | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Can we make this more direct? "X% speedup" or "XMBps throughput"? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Added ~17X all in all improvement!💪💪💪 |
||
--- | ||
|
||
# Accelerating AI/ML Workflows in Earth Sciences with GPU-Native Xarray and Zarr (and more!) | ||
|
||
negin513 marked this conversation as resolved.
Show resolved
Hide resolved
|
||
## TL;DR | ||
|
||
## Introduction | ||
|
||
In large-scale geospatial AI and machine learning workflows, data loading is often the main bottleneck. Traditional pipelines rely on CPUs to preprocess and transfer massive datasets from storage to GPU memory, consuming resources and limiting scalability and effective use of GPU resources. | ||
|
||
To tackle this issue, a team from the [National Center for Atmospheric Research (NSF-NCAR)](https://ncar.ucar.edu) and [Development Seed](https://developmentseed.org) with mentors from [NVIDIA](https://www.nvidia.com) participated in a GPU hackathon to demonstrate how AI/ML workflows in Earth system sciences can benefit from GPU-native workflows using tools such as [Zarr](https://zarr.readthedocs.io/), [KvikIO](https://docs.rapids.ai/api/kvikio/stable/), and [DALI](https://developer.nvidia.com/dali). | ||
negin513 marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
In this post, we share our hackathon experience, the integration strategies we explored, and the performance gains we achieved to highlight how modern tools can transform data-intensive workflows. | ||
negin513 marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
## Problem | ||
|
||
Machine learning pipelines typically involve: | ||
|
||
- Reading data from disk or object storage. | ||
- Transforming / preprocessing data (often CPU-bound). | ||
- Feeding the data into GPUs for training or inference. | ||
|
||
Although GPU compute is incredibly fast, the CPU can become a bottleneck when dealing with large datasets (profiles discussed below). | ||
|
||
In this hackathon, we tried looking at different ways of reducing this bottleneck. | ||
|
||
### Data & Code Overview | ||
|
||
For this hackathon, we developed a benchmark of training a U-NET (with ResNet backend) model on the ERA-5 Dataset to predict next time steps. The U-Net model is implemented in PyTorch and the training pipeline is built using PyTorch DataLoader. The model can be trained on a single GPU or multiple GPUs using Distributed Data Parallel (DDP) for parallelization. | ||
negin513 marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
-- TODO : Add an example image. | ||
negin513 marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
The basic data loader is implemented in `zarr_ML_optimization/train_unet.py` and the model is defined in `zarr_ML_optimization/model.py`. The training pipeline is designed to be flexible and can be easily adapted to different datasets and models. | ||
|
||
More details on the model and training pipeline can be found in the [README](https://github.com/pangeo-data/ncar-hackathon-xarray-on-gpus/blob/main/zarr_ML_optimization/README.md) file in the `zarr_ML_optimization` folder. | ||
|
||
### Performance Bottlenecks | ||
|
||
First, we needed to identify the performance bottlenecks in our pipeline. We used NVIDIA's [Nsight Systems](https://developer.nvidia.com/nsight-systems) to profile our code and identify the areas that needed optimization. | ||
|
||
Here are some screenshots of the profiling results: | ||
|
||
 | ||
 | ||
|
||
<div style={{ display: 'flex', gap: '1rem', justifyContent: 'center' }}> | ||
<img | ||
src='/posts/gpu-pipline/profiling_screenshot1.png' | ||
alt='Issue 1' | ||
style={{ width: '45%' }} | ||
/> | ||
<img | ||
src='/posts/gpu-pipline/profiling_screenshot2.png' | ||
alt='Issue 2' | ||
style={{ width: '45%' }} | ||
/> | ||
</div> | ||
|
||
The profiling results clearly showed that the data loading step was the main bottleneck in our pipeline. Additionally, we noticed the alternating CPU and GPU compute steps (i.e. data loading and model training) were not overlapping, which meant that the GPU was often idle while waiting for the CPU to load data (fist screenshot above). | ||
|
||
This was also confirmed by a few other tests to measure the time spent on data loading and model training. The results are shown below: | ||
|
||
<div style={{ textAlign: 'center' }}> | ||
<img | ||
src='/posts/gpu-pipline/baseline.png' | ||
alt='baseline plot' | ||
style={{ width: '60%', maxWidth: '500px' }} | ||
/> | ||
</div> | ||
|
||
In the plot above, we show the throughput of the data loading and training steps in our pipeline. The three bars represent: | ||
|
||
- Real Data: Baseline throughput of the end-to-end pipeline using real data. | ||
- No Training (i.e. data loading throughput): Throughput of the data loading without any training (to measure the time spent on data loading vs. training). | ||
- Synthetic Data (i.e. Training throughput): Throughput of the data loading using synthetic data (to remove the data loading bottleneck). | ||
|
||
The results show that the data loading step is the main bottleneck in our pipeline, with **much** lower throughput compared to the training step. | ||
|
||
## Hackathon: Putting this altogether | ||
|
||
Our initial profiling showed that data loading is a major bottleneck in this workflow. | ||
|
||
During the hackathon, we tested the following strategies to improve the data loading performance: | ||
|
||
1. Optimized Chunking & Compression | ||
- We explored different chunking and compression strategies to optimize the data loading performance. We found that using Zarr v3 with optimized chunking and compression significantly improved the data loading performance. | ||
negin513 marked this conversation as resolved.
Show resolved
Hide resolved
|
||
2. GPU native data loading with Zarr V3 and KvikIO | ||
3. Using `nvcomp` for decompression on GPUs | ||
4. NVIDIA DALI: We explored integrating NVIDIA's Data Loading Library (DALI) into Xarray to facilitate efficient data loading and preprocessing directly on the GPU. DALI provides highly optimized building blocks and an execution engine for data processing, accelerating deep learning applications. | ||
|
||
### Step 1: Optimized chunking :card_file_box: | ||
|
||
The ERA-5 dataset we were using had a sub-optimal chunking scheme of `{'time': 10, 'channel': C, 'height': H, 'width': W}`, which meant that a minimum of 10 timesteps of data was being read even if we only needed 2 consecutive timesteps at a time. | ||
negin513 marked this conversation as resolved.
Show resolved
Hide resolved
|
||
We decided to rechunk the data to align with our access pattern of 1-timestep at a time, while reformating to Zarr v3. | ||
negin513 marked this conversation as resolved.
Show resolved
Hide resolved
|
||
The full script is available [here](https://github.com/pangeo-data/ncar-hackathon-xarray-on-gpus/blob/main/rechunk/era5_rechunking.ipynb), with the main code looking like so: | ||
|
||
```python | ||
import xarray as xr | ||
|
||
ds: xr.Dataset = xr.open_mfdataset("ERA5.zarr") | ||
# Rechunk the data | ||
ds = ds.chunk({"time": 1, "level": 1, "latitude": 640, "longitude": 1280}) | ||
# Save to Zarr v3 | ||
ds.to_zarr("rechunked_ERA5.zarr", zarr_version=3) | ||
``` | ||
|
||
For more optimal performance, consider: | ||
|
||
1. Storing the data without compression (if not transferring over a network), as decompressing data can slow down read speeds. But see also GPU decompression with nvCOMP below. :wink: | ||
2. Concatenating several data variables together **if** a single chunk size is too small (`<1MB`), at the expense of reducing readability of the Zarr store. | ||
Having too many small chunks can be detrimental to read speeds. A compressed chunk should be `>1MB`, `<100MB` (??TODO verify) for optimal reads. | ||
- Alternatively, wait for [sharding](https://zarr.readthedocs.io/en/stable/user-guide/performance.html#sharding) to be supported for GPU buffers in zarr-python. | ||
|
||
The plot below shows the read performance of the original dataset vs. the rechunked dataset (to optimal chunk size) vs. uncompressed zarr v3 dataset. | ||
|
||
TODO: ADD plot here. | ||
|
||
### Step 2: Reading with zarr-python v3 + kvikIO :open_book: | ||
|
||
The advent of [Zarr v3](https://zarr.dev/blog/zarr-python-3-release/) bought many improvements, including the ability to [read from Zarr stores to CuPy arrays (i.e. GPU memory)](https://github.com/zarr-developers/zarr-python/issues/2574). | ||
|
||
Specifically, you can use the [`zarr-python`](https://github.com/zarr-developers/zarr-python) driver to read data from zarr->CPU->GPU, or the [`kvikio`](https://github.com/rapidsai/kvikio) driver to read data from zarr->GPU directly! | ||
|
||
To benefit from these new features, we recommend installing: | ||
|
||
- [`zarr>=3.0.3`](https://github.com/zarr-developers/zarr-python/releases/tag/v3.0.3) | ||
negin513 marked this conversation as resolved.
Show resolved
Hide resolved
|
||
- [`xarray>=2025.03.00`](https://github.com/pydata/xarray/releases/tag/v2025.03.0) | ||
- [`kvikio>=25.04.00`](https://github.com/rapidsai/kvikio/releases/tag/v25.04.00) | ||
|
||
Reading to GPU can be enabled by using the [`zarr.config.enable_gpu()`](https://zarr.readthedocs.io/en/v3.0.6/user-guide/gpu.html) setting like so: | ||
|
||
```python | ||
import cupy as cp | ||
import xarray as xr | ||
import zarr | ||
|
||
airt = xr.tutorial.open_dataset("air_temperature", engine="netcdf4") | ||
airt.to_zarr(store="/tmp/air-temp.zarr", mode="w", zarr_format=3, consolidated=False) | ||
|
||
with zarr.config.enable_gpu(): | ||
ds = xr.open_dataset("/tmp/air-temp.zarr", engine="zarr", consolidated=False) | ||
negin513 marked this conversation as resolved.
Show resolved
Hide resolved
|
||
assert isinstance(ds.air.data, cp.ndarray) | ||
``` | ||
|
||
Note that using `engine="zarr"` like above would still result in data being loaded into CPU memory before it goes to GPU memory. | ||
If you prefer to bypass CPU memory, and have GPU Direct Storage (GDS) enabled, you can use the `kvikio` driver like so: | ||
|
||
```python | ||
import kvikio.zarr | ||
|
||
with zarr.config.enable_gpu(): | ||
store = kvikio.zarr.GDSStore(root="/tmp/air-temp.zarr") | ||
ds = xr.open_dataset(filename_or_obj=store, engine="zarr") | ||
assert isinstance(ds.air.data, cp.ndarray) | ||
``` | ||
|
||
This will read the data directly from the Zarr store to GPU memory, bypassing CPU memory altogether. This is especially useful for large datasets, as it reduces the amount of data that needs to be transferred between CPU and GPU memory. | ||
|
||
[ TODO: add a figure showing this -- technically decompression is still done on CPU. ] | ||
|
||
(TODO ongoing work) Eventually with this [cupy-xarray Pull Request merged](https://github.com/xarray-contrib/cupy-xarray/pull/70) (based on earlier work at https://xarray.dev/blog/xarray-kvikio), this can be simplified to: | ||
|
||
```python | ||
import cupy_xarray | ||
|
||
ds = xr.open_dataset(filename_or_obj="/tmp/air-temp.zarr", engine="kvikio") | ||
assert isinstance(ds.air.data, cp.ndarray) | ||
``` | ||
negin513 marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
How do these two methods, zarr (CPU) and kvikio (GPU), compare? | ||
|
||
(TODO put in benchmark numbers here). | ||
|
||
For kvikio performance improvements, you need GPU Direct Storage (GDS) enabled on your system. This is a feature that allows the GPU to access data directly from storage, bypassing the CPU and reducing latency. GDS is supported on NVIDIA GPUs with the [GPUDirect Storage](https://docs.nvidia.com/datacenter/pgp/gds/index.html) feature. | ||
|
||
### Step 3: GPU-based decompression with nvCOMP :rocket: | ||
|
||
For a fully GPU-native workflow, we can let the GPU do all of the work! | ||
This includes reading the compressed data, decompression (using nvCOMP), any augmentation steps, to the ML model training. | ||
|
||
 | ||
|
||
Sending compressed instead of uncompressed data to the GPU means less data transfer overall, reducing I/O latency from storage to device. | ||
To unlock this, we would need zarr-python to support GPU-based decompression codecs, with one for Zstandard (Zstd) currently being implemented at https://github.com/zarr-developers/zarr-python/pull/2863. | ||
|
||
 | ||
|
||
Figure above shows benchmark comparing CPU vs GPU-based decompression, with or without GDS enabled. | ||
|
||
Keep an eye on this space! | ||
|
||
### Step 4: Overlapping CPU and GPU compute with NVIDIA DALI :twisted_rightwards_arrows: | ||
|
||
Ideally, we want to minimize pauses where the device (GPU) is waiting for the host (CPU) or vice versa. This is one of the reasons we went with [NVIDIA DALI](https://docs.nvidia.com/deeplearning/dali/user-guide/docs/index.html) that enables overlapping CPU and GPU computation. | ||
|
||
Our full codebase is available at https://github.com/pangeo-data/ncar-hackathon-xarray-on-gpus for reference. | ||
To start, there is a [zarr_DALI](https://github.com/pangeo-data/ncar-hackathon-xarray-on-gpus/tree/main/zarr_DALI) folder with short, contained examples of a DALI pipeline loading from Zarr. | ||
|
||
Next, look at the [zarr_ML_optimization](https://github.com/pangeo-data/ncar-hackathon-xarray-on-gpus/tree/main/zarr_ML_optimization) folder that contains an end-to-end example on how this DALI pipeline can be integrated into a Pytorch Dataloader and full training workflow. | ||
|
||
 | ||
|
||
(TODO insert better nsight profiling figure than above showing overlapping CPU and GPU compute) | ||
negin513 marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
## Going Forward | ||
|
||
This work is still ongoing, and we are continuing to explore ways to optimize data loading and processing for large-scale geospatial AI/ML workflows. We started this work during a 3-day hackathon, and we are excited to continue this work in the future. During the hackathon, we were able to make significant progress in optimizing data loading and processing for large-scale geospatial AI/ML workflows. | ||
|
||
We are continuing to explore the following areas: | ||
|
||
- GPU Direct Storage (GDS) for optimal performance | ||
- NVIDIA DALI | ||
- Work out how to use GDS when reading from cloud object store instead of on-prem disk. | ||
- etc | ||
negin513 marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
## Lessons Learned | ||
|
||
- Chunking matters! It really does. | ||
- Consider using GPU Direct Storage (GDS) for optimal performance, but be aware of the setup and configuration required. | ||
- GPU Direct Storage (GDS) can be an improvement for data-intensive workflows, but requires some setup and configuration. | ||
- NVIDIA DALI is a powerful tool for optimizing data loading, but requires some effort to integrate into existing workflows. | ||
- GPU-based decompression is a promising area for future work, but requires further development and testing. | ||
negin513 marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
## Acknowledgements | ||
|
||
This work was developed during the [NCAR/NOAA Open Hackathon](https://www.openhackathons.org/s/siteevent/a0CUP00000rwYYZ2A2/se000355) in Golden, Colorado from 18-27 February 2025. We would like to thank the OpenACC Hackathon for the opportunity to participate and learn from this experience. |
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.