Description
What happened?
I'm not too sure this is a bug report, but I figured I would share some of the investigation I've done on the topic of writing large datasets to netcdf.
For clarity, the usecase I'm considering is writing large in-memory array to persistant storage on Linux.
- Array size 4-100GB
- File Format Netcdf
- Dask: No.
- Hardware: Some very modern SSD that can write more than 1GB/s (Sabrent Rocket 4 Plus for example).
- Operating System: Linux
- Xarray version: 0.21.1 (or around there)
The symptoms are two fold:
- The write speed is slow. About 1GB/s, much less than the 2-3 GB/s you can get with other means.
- The Linux disk cache just keeps filling up.
Its quite hard to get good performance from systems, so I"m going to put a few more constraints on the type of data we are are writing:
- The underlying numpy array must be alight to the linux Page boundary of 4096 bytes.
- The underlying numpy array must have been pre-faulted and not swapped. (Do not use
np.zeros
, it doesn't fault the memory)
I feel like these two options are rather easy to get to as I'll show in my example.
What did you expect to happen?
I want to be able to write at 3.2GB/s with my shiny new SSD.
I want to leave my RAM unused when I'm archiving to disk.
Minimal Complete Verifiable Example
import numpy as np
import xarray as xr
def empty_aligned(shape, dtype=np.float64, align=4096):
if not isinstance(shape, tuple):
shape = (shape,)
dtype = np.dtype(dtype)
size = dtype.itemsize
# Compute the final size of the array
for s in shape:
size *= s
a = np.empty(size + (align - 1), dtype=np.uint8)
data_align = a.ctypes.data % align
offset = 0 if data_align == 0 else (align - data_align)
arr = a[offset:offset + size].view(dtype)
# Don't use reshape since reshape might copy the data.
# This is the suggested way to assign a new shape with guarantee
# That the data won't be copied.
arr.shape = shape
return arr
dataset = xr.DataArray(
empty_aligned((4, 1024, 1024, 1024), dtype='uint8'),
name='mydata').to_dataset()
# Fault and write data to this dataset
dataset['mydata'].data[...] = 1
%time dataset.to_netcdf("test", engine='h5netcdf')
%time dataset.to_netcdf("test", engine='netcdf4')
Relevant log output
Both output about 3.5s equivalent to just about 1GB/s.
To get to about 3 ish GB/s (taking about 1.27s to write a 4GB array). One needs to do a few things:
- You must align the underlying data to disk.
- You must use a driver that bypasses the operating system cache
For the h5netcdf backend you would have to add the following kwargs to h5netcdf constructor
kwargs = {
"invalid_netcdf": invalid_netcdf,
"phony_dims": phony_dims,
"decode_vlen_strings": decode_vlen_strings,
'alignment_threshold': alignment_threshold,
'alignment_interval': alignment_interval,
}
Anything else we need to know?
The main challenge is that while writing aligned data this way is REALLY fast, writing small chunks and unaligned data becomes REALLY slow.
Personally, I think that someone might be able to write a new HDF5 driver that does better optimization, I feel like this can help people loading large datasets which seems to be a large part of the community of xarray users.
Environment
INSTALLED VERSIONS
------------------
commit: None
python: 3.9.9 (main, Dec 29 2021, 07:47:36)
[GCC 9.4.0]
python-bits: 64
OS: Linux
OS-release: 5.13.0-30-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: ('en_US', 'UTF-8')
libhdf5: 1.12.1
libnetcdf: 4.8.1
xarray: 0.21.1
pandas: 1.4.0
numpy: 1.22.2
scipy: 1.8.0
netCDF4: 1.5.8
pydap: None
h5netcdf: 0.13.1
h5py: 3.6.0.post1
Nio: None
zarr: None
cftime: 1.5.2
nc_time_axis: None
PseudoNetCDF: None
rasterio: None
cfgrib: None
iris: None
bottleneck: None
dask: 2022.01.1
distributed: None
matplotlib: 3.5.1
cartopy: None
seaborn: None
numbagg: None
fsspec: 2022.01.0
cupy: None
pint: None
sparse: None
setuptools: 60.8.1
pip: 22.0.3
conda: None
pytest: None
IPython: 8.0.1
sphinx: None
h5py includes some additions of mine that allow you to use the DIRECT driver and I am using a version of HDF5 that is built with the DIRECT driver.