Skip to content

Some iris netcdf saves are fetching all the data. #5753

Closed
@pp-mo

Description

@pp-mo

From a support issue loading a large dataset from a GRIB file and saving to netcdf.
The example data was about 16Gb, and was running out of memory when saving.

It seems that at least some netcdf saves are not able to be correctly saved in a chunk-by-chunk streamed manner.

Simple example to reproduce

import dask.array as da
import iris
from iris.cube import Cube
import tracemalloc

# Reasonably sized usage example, big enough to show problem
nt, ny, nx = 50, 700, 1400
# The data must be a stack of lazy arrays.
# A single random array with these chunks does *not* show the problem.
chunk_arrays = [
    da.random.uniform(0., 1, size=(ny, nx))
    for _ in range(nt)
]
data = da.stack(chunk_arrays)
cube = Cube(data)
print(f"Cube to save : {cube.summary(shorten=True)}")
data_size_mb = data.size * data.dtype.itemsize * 1.0 / 1024 ** 2
print(f"Cube data shape = {data.shape}, size = {data_size_mb:.1f} Mb")

tracemalloc.start()
iris.save(cube, 'tmp.nc')
_, peak_mem = tracemalloc.get_traced_memory()

memory_mb = peak_mem * 1.0 / 1024 ** 2
print(f"Peak memory used during save : {memory_mb:.1f}Mb")

from which, a sample output :

Cube to save : unknown / (unknown)                 (-- : 50; -- : 700; -- : 1400)
Cube data shape = (50, 700, 1400), size = 373.8 Mb
Peak memory used during save : 377.9Mb

For comparison, it is OK in xarray (!! sorry 😬 !!) :

( using same data array )

import xarray as xr
xrda = xr.DataArray(data)

tracemalloc.stop()
tracemalloc.start()
xrda.to_netcdf('tmp.nc')
_, peak_mem = tracemalloc.get_traced_memory()

memory_mb = peak_mem * 1.0 / 1024 ** 2
print(f"Peak memory used during Xarray save : {memory_mb:.1f}Mb")

from which..

Peak memory used during Xarray save : 30.2Mb

Expected result

The total memory required is expected to be only be around N or maybe N+1 * chunk sizes,
where N is the number of Dask workers
-- which here that was =4 (threads or "cpus").
So in this case, approx 8Mb per chunk, expected ~32-40 Mb., as seen in the Xarray case.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions