Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

open_mfdataset very slow #7697

Open
2 of 4 tasks
groutr opened this issue Mar 29, 2023 · 6 comments
Open
2 of 4 tasks

open_mfdataset very slow #7697

groutr opened this issue Mar 29, 2023 · 6 comments

Comments

@groutr
Copy link

groutr commented Mar 29, 2023

What happened?

I am trying to open an mfdataset consisting of over 4400 files. The call completes in 342.735s on my machine. After running a profiler, I discovered that most of that time is spent reading the first 8 bytes of the file. However, on my filesystem, looking at my system resource monitor, it looks like the entire file is being read (with a sustained 40-50MB of read IO most of that time).

I traced the bottleneck down to

magic_number = filename_or_obj.read(count)
According to my profile, 264.381s (77%) of the execution time is spent on this line.

I isolated the essence of this code, by reading the first 8 bytes of each file.

for f in files:
    with open(f, 'rb') as fh:
        if fh.tell() != 0:
            fh.seek(0)
        magic = fh.read(8)
        fh.seek(0)

Profiling this on my directory of netcdf files took 137.587s (not sure why this was faster than 264s, caching maybe?). Changing the fh.read(8) to fh.read1(8), the execution time dropped to 1.52s.

What did you expect to happen?

I expected open_mfdataset to be quicker.

Minimal Complete Verifiable Example

import xarray as xr
import pathlib

files = [... <list of 4400 filenames> ...]
# This takes almost 6 minutes to finish.
D = xr.open_mfdataset(files, compat='override', coords='minimal', data_vars='minimal')

MVCE confirmation

  • Minimal example — the example is as focused as reasonably possible to demonstrate the underlying issue in xarray.
  • Complete example — the example is self-contained, including all data and the text of any traceback.
  • Verifiable example — the example copy & pastes into an IPython prompt or Binder notebook, returning the result.
  • New issue — a search of GitHub Issues suggests this is not a duplicate.

Relevant log output

No response

Anything else we need to know?

I cannot share the netcdf files. I believe this issue to isolated, and possibly triggered by the shared filesystems found on supercomputers.

Environment

INSTALLED VERSIONS ------------------ commit: None python: 3.11.0 | packaged by conda-forge | (main, Jan 14 2023, 12:27:40) [GCC 11.3.0] python-bits: 64 OS: Linux OS-release: 3.10.0-1160.80.1.el7.x86_64 machine: x86_64 processor: x86_64 byteorder: little LC_ALL: None LANG: None LOCALE: (None, None) libhdf5: 1.12.2 libnetcdf: 4.9.1

xarray: 2023.2.0
pandas: 1.5.3
numpy: 1.24.2
scipy: 1.10.1
netCDF4: 1.6.2
pydap: None
h5netcdf: None
h5py: None
Nio: None
zarr: None
cftime: 1.6.2
nc_time_axis: None
PseudoNetCDF: None
rasterio: 1.3.6
cfgrib: None
iris: None
bottleneck: None
dask: 2023.3.1
distributed: None
matplotlib: 3.7.1
cartopy: None
seaborn: None
numbagg: None
fsspec: 2023.3.0
cupy: None
pint: None
sparse: None
flox: None
numpy_groupies: None
setuptools: 67.5.1
pip: 23.0.1
conda: None
pytest: None
mypy: None
IPython: 8.11.0
sphinx: None

@groutr groutr added bug needs triage Issue that has not been reviewed by xarray team member labels Mar 29, 2023
@Illviljan
Copy link
Contributor

Looks like you almost got this figured out! You want to create a PR for this?

@headtr1ck
Copy link
Collaborator

It seems that this problematic code is mostly used to determine the engine that is used to finally open it. Did you try specifying the correct engine directly?

@headtr1ck headtr1ck added topic-performance io and removed bug needs triage Issue that has not been reviewed by xarray team member labels Mar 29, 2023
@groutr
Copy link
Author

groutr commented Mar 29, 2023

It seems that this problematic code is mostly used to determine the engine that is used to finally open it. Did you try specifying the correct engine directly?

I tried setting the engine to 'netcdf4' and while it did help a little bit, it still seems slow on my system.

Here is my profile with engine='netcdf4'
slowmfdataset

I'm not sure what to make of this profile. I don't see anything in the file_manager that would be especially slow. Perhaps it is a filesystem bottleneck at this point (given that the cpu time is 132s of the total 288s duration).

@dcherian
Copy link
Contributor

Fundamentally, xarray has to touch every file because there is no guarantee they are consistent with each other.

A number of us now use kerchunk to create virtual aggregate datasets that can be read a lot faster.

@groutr
Copy link
Author

groutr commented Mar 29, 2023

@dcherian I'll look at that. I thought the compat='override' option bypassed most of the consistency checking. In my case, it is typically safe to assume the set of files are consistent (each file represents one timestep, the structure of each file is otherwise identical).

@headtr1ck I was just informed that the underlying filesystem is actually a networked filesystem. The PR might still be useful, but the latest profile seems more reasonable in light of my new info.

@dcherian
Copy link
Contributor

I thought the compat='override' option bypassed most of the consistency checking.

we still construct a dataset representation for each file which involves reading all coordinates etc. The consistency checking is bypassed at the "concatenation" stage.

You could also speed using dask by setting up a cluster and using parallel=True

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants