opening datasets slow compared to netCDF4 #9058

harzer99 · 2024-05-31T13:15:08Z

harzer99
May 31, 2024

Hi everyone I am working with netcdf4 files. To my suprise it takes me over 5 minutes to open a 1Gb file. I also tried it with the netCDF4 library, which to my understanding is a dependency of xarray. Here it only takes a couple of seconds. I already did some research around the open_dataset performance issues. Most of them seemed to be related to time decoding. So I tried to disable it and also all other decoding options I found. But I wasn't able to see an improvement. Code example below and file can be downloaded here:

import xarray as xr
from netCDF4 import Dataset
file_path = 'biodiversity/bioscen15-sdm-gam_ewembi_nobc_hist_nosoc_co2_birdprob_global_30year-mean_1995_1995.nc4'

with Dataset(file_path) as netcdf_file:
    print(netcdf_file) # ~ 5 s




with xr.open_dataset(file_path, engine = 'netcdf4', chunks=({'time':10, 'lat': 10,'lon':10}),  decode_times=False, decode_cf=False, decode_coords = False) as xrf:
    print(xrf)  # > 5 min

Also I am working on a mac with the M3 chip.

I appreciate any idea on how to improve performance, because the current state is unusable for me.

Best regards :)

Answered by dcherian

Jun 4, 2024

Apparently netCDF4._netCDF4.Variable.shape is quite slow:

%time ds = nc.Dataset("/Users/deepak/Downloads/bioscen15-sdm-gam_ewembi_nobc_hist_nosoc_co2_birdprob_global_30year-mean_1995_1995.nc4")
%time ds.variables["Abeillia_abeillei"].shape

CPU times: user 915 ms, sys: 44.4 ms, total: 960 ms
Wall time: 963 ms
CPU times: user 17 ms, sys: 3.11 ms, total: 20.1 ms
Wall time: 20.1 ms

20ms * 8500 vars = 170 seconds. This gets run twice so at least 340 seconds :) . But we can avoid that quite easily: 50% speedup here: https://github.com/pydata/xarray/pull/9067/files

Here's a profile for open_store_variable, I made a small edit to time NetCDF4ArrayWrapper separately.

Line #      Hits         Ti…

View full answer

kmuehlbauer · 2024-05-31T14:03:42Z

kmuehlbauer
May 31, 2024
Maintainer

@harzer99 I'd think the large amount of variables (~8500) is the bottleneck here. They have to be aligned wrt dimensions and probably these checks take quite some time.

As all variables are of the same structure (time, lat, lon) and even the attributes are the same, they might just merged into one very large variable (eg. time, vname, lat, lon) or something like that with an additional coordinate vname consisting of the variable names.

I'd need to look up if any of the CLI tools (eg. cdo) are capable of doing that. If not, you might just do this once for your data using xarray. Maybe others have some more easy solution to this problem.

2 replies

trexfeathers May 31, 2024

Recommend ncdata!

harzer99 Jun 3, 2024
Author

Thanks for the suggestions. I will have a look into it.

dcherian · 2024-05-31T16:50:19Z

dcherian
May 31, 2024
Maintainer

Why are you specifying chunks here? If you take it out, they should be very similar.

1 reply

harzer99 Jun 3, 2024
Author

I in some thread, that it (force) enables Dask. It ended up being just another thing I tried.

harzer99 · 2024-06-03T15:38:29Z

harzer99
Jun 3, 2024
Author

I tried out many different librarlies for opening the .nc4 file. I ended up building a function that loads the file with the h5py library copies the whole data into a dictionary and then creates a xr.Dataset from that dictionary. This only takes 23s. The function xr.load_dataset, which would provide a similar functionality takes 515s.
The code:

import h5py
import xarray as xr
import time
from tqdm import tqdm


def custom_load(file_path):
    with h5py.File(file_path, "r") as hdf_file:
        datasets = {}

        for ds_name, ds in tqdm(hdf_file.items()):
            if len(ds.shape) == 3:
                metadata = {attr_name: ds.attrs[attr_name] for attr_name in ds.attrs}
                ds_dict = {
                    "attrs": metadata,
                    "data": ds[:],
                    "dims": ["lat", "long", "prob"],
                }
                datasets[ds_name] = ds_dict

        return datasets


t0 = time.time()
file_path = "biodiversity/historical/bioscen15-sdm-gam_ewembi_nobc_hist_nosoc_co2_birdprob_global_30year-mean_1995_1995.nc4"
ds_dict = custom_load(file_path)

# runtime: 23s, Size in memory: 18GB 
ds = xr.Dataset.from_dict(ds_dict)
print(f"custom load runtime {time.time()-t0}")
print(f"custom loaded dataset {ds}")
del(ds)

# runtime: 515s, Size in memory: 18GB
t0 = time.time()
ds = xr.load_dataset(file_path, decode_times=False)
print(f"xarray loaded dataset {ds}")
print(f"xr load runtime {time.time()-t0}")

Could someone verify this behaviour with an x86 machine? There is a chance that one of the compiled functions xarray is calling is not compiled for apple silicon. Afaik the machine then falls back onto a emulator, which migth be the performance issue.

6 replies

harzer99 Jun 4, 2024
Author

I already tried it. I am quite sure it is not an issue of the backend, but something xarray is doing on top. I tried h5py and netcdf4 on their own and they both are over an order of magnitude faster.

dcherian Jun 4, 2024
Maintainer

Is this file publicly accessible

kmuehlbauer Jun 4, 2024
Maintainer

@dcherian It's linked in the OP.

https://data.isimip.org/datasets/f5aff8e3-df1d-44d8-996a-ac9a8aa2e415/

dcherian Jun 4, 2024
Maintainer

This dataset has 8500 variables!!!

ncdump -h bioscen15-sdm-gam_ewembi_nobc_hist_nosoc_co2_birdprob_global_30year-mean_1995_1995.nc4 | grep "(time, lat, lon)" | wc -l

kmuehlbauer Jun 4, 2024
Maintainer

Yes, I was assuming in my earlier post, that the sheer amount of variables and the subsequent checks are the bottleneck.

dcherian · 2024-06-04T15:34:46Z

dcherian
Jun 4, 2024
Maintainer

Apparently netCDF4._netCDF4.Variable.shape is quite slow:

%time ds = nc.Dataset("/Users/deepak/Downloads/bioscen15-sdm-gam_ewembi_nobc_hist_nosoc_co2_birdprob_global_30year-mean_1995_1995.nc4")
%time ds.variables["Abeillia_abeillei"].shape

CPU times: user 915 ms, sys: 44.4 ms, total: 960 ms
Wall time: 963 ms
CPU times: user 17 ms, sys: 3.11 ms, total: 20.1 ms
Wall time: 20.1 ms

20ms * 8500 vars = 170 seconds. This gets run twice so at least 340 seconds :) . But we can avoid that quite easily: 50% speedup here: https://github.com/pydata/xarray/pull/9067/files

Here's a profile for open_store_variable, I made a small edit to time NetCDF4ArrayWrapper separately.

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
   420                                               def open_store_variable(self, name: str, var):
   421       100      63000.0    630.0      0.0          import netCDF4
   422                                           
   423       100     133000.0   1330.0      0.0          dimensions = var.dimensions
   424       404    1652000.0   4089.1      0.0          attributes = {k: var.getncattr(k) for k in var.ncattrs()}
   425       100 1744036000.0    2e+07     49.5          wrapped = NetCDF4ArrayWrapper(name, self)
   426       100     959000.0   9590.0      0.0          data = indexing.LazilyIndexedArray(wrapped)
   427       100      13000.0    130.0      0.0          encoding: dict[str, Any] = {}
   428       100      82000.0    820.0      0.0          if isinstance(var.datatype, netCDF4.EnumType):
   429                                                       encoding["dtype"] = np.dtype(
   430                                                           data.dtype,
   431                                                           metadata={
   432                                                               "enum": var.datatype.enum_dict,
   433                                                               "enum_name": var.datatype.name,
   434                                                           },
   435                                                       )
   436                                                   else:
   437       100      17000.0    170.0      0.0              encoding["dtype"] = var.dtype
   438       100      99000.0    990.0      0.0          _ensure_fill_value_valid(data, attributes)
   439                                                   # netCDF4 specific encoding; save _FillValue for later
   440       100   32673000.0 326730.0      0.9          filters = var.filters()
   441       100      18000.0    180.0      0.0          if filters is not None:
   442       100      51000.0    510.0      0.0              encoding.update(filters)
   443       100      78000.0    780.0      0.0          chunking = var.chunking()
   444       100      17000.0    170.0      0.0          if chunking is not None:
   445       100      23000.0    230.0      0.0              if chunking == "contiguous":
   446         2          0.0      0.0      0.0                  encoding["contiguous"] = True
   447         2       1000.0    500.0      0.0                  encoding["chunksizes"] = None
   448                                                       else:
   449        98      12000.0    122.4      0.0                  encoding["contiguous"] = False
   450        98      32000.0    326.5      0.0                  encoding["chunksizes"] = tuple(chunking)
   451        98     161000.0   1642.9      0.0                  encoding["preferred_chunks"] = dict(zip(var.dimensions, chunking))
   452                                                   # TODO: figure out how to round-trip "endian-ness" without raising
   453                                                   # warnings from netCDF4
   454                                                   # encoding['endian'] = var.endian()
   455       100      72000.0    720.0      0.0          pop_to(attributes, encoding, "least_significant_digit")
   456                                                   # save source so __repr__ can detect if it's local or not
   457       100      15000.0    150.0      0.0          encoding["source"] = self._filename
   458       100 1738262000.0    2e+07     49.4          encoding["original_shape"] = var.shape
   459                                           
   460        99    1852000.0  18707.1      0.1          return Variable(dimensions, data, attributes, encoding)

Here's a profile for NetCDF4ArrayWrapper.__init__:

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
    60                                               def __init__(self, variable_name, datastore):
    61       100      10000.0    100.0      0.0          self.datastore = datastore
    62       100      11000.0    110.0      0.0          self.variable_name = variable_name
    63                                           
    64       100    1670000.0  16700.0      0.0          array = self.get_array()
    65       100 1729020000.0    2e+07     50.2          self.shape = array.shape
    66                                           
    67       100      47000.0    470.0      0.0          dtype = array.dtype
    68       100      29000.0    290.0      0.0          if dtype is str:
    69                                                       # use object dtype (with additional vlen string metadata) because that's
    70                                                       # the only way in numpy to represent variable length strings and to
    71                                                       # check vlen string dtype in further steps
    72                                                       # it also prevents automatic string concatenation via
    73                                                       # conventions.decode_cf_variable
    74                                                       dtype = coding.strings.create_vlen_dtype(str)
    75        99 1714627000.0    2e+07     49.8          self.dtype = dtype

2 replies

harzer99 Jun 5, 2024
Author

Thanks for looking into it! I need to wait for the next release unless I install it from the git repository?

dcherian Jun 5, 2024
Maintainer

Yes but your custom function is still a lot faster (somehow avoids .shape I guess) so you should just do that.

Uh oh!

opening datasets slow compared to netCDF4 #9058

Uh oh!

Uh oh!

harzer99 May 31, 2024

Replies: 4 comments · 11 replies

Uh oh!

kmuehlbauer May 31, 2024 Maintainer

Uh oh!

trexfeathers May 31, 2024

Uh oh!

harzer99 Jun 3, 2024 Author

Uh oh!

dcherian May 31, 2024 Maintainer

Uh oh!

harzer99 Jun 3, 2024 Author

Uh oh!

harzer99 Jun 3, 2024 Author

Uh oh!

harzer99 Jun 4, 2024 Author

Uh oh!

dcherian Jun 4, 2024 Maintainer

Uh oh!

kmuehlbauer Jun 4, 2024 Maintainer

Uh oh!

dcherian Jun 4, 2024 Maintainer

Uh oh!

kmuehlbauer Jun 4, 2024 Maintainer

Uh oh!

dcherian Jun 4, 2024 Maintainer

Uh oh!

harzer99 Jun 5, 2024 Author

Uh oh!

dcherian Jun 5, 2024 Maintainer

harzer99
May 31, 2024

Replies: 4 comments 11 replies

kmuehlbauer
May 31, 2024
Maintainer

harzer99 Jun 3, 2024
Author

dcherian
May 31, 2024
Maintainer

harzer99 Jun 3, 2024
Author

harzer99
Jun 3, 2024
Author

harzer99 Jun 4, 2024
Author

dcherian Jun 4, 2024
Maintainer

kmuehlbauer Jun 4, 2024
Maintainer

dcherian Jun 4, 2024
Maintainer

kmuehlbauer Jun 4, 2024
Maintainer

dcherian
Jun 4, 2024
Maintainer

harzer99 Jun 5, 2024
Author

dcherian Jun 5, 2024
Maintainer