-
-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
open_mfdataset very slow #7697
Comments
Looks like you almost got this figured out! You want to create a PR for this? |
It seems that this problematic code is mostly used to determine the engine that is used to finally open it. Did you try specifying the correct engine directly? |
Fundamentally, xarray has to touch every file because there is no guarantee they are consistent with each other. A number of us now use kerchunk to create virtual aggregate datasets that can be read a lot faster. |
@dcherian I'll look at that. I thought the @headtr1ck I was just informed that the underlying filesystem is actually a networked filesystem. The PR might still be useful, but the latest profile seems more reasonable in light of my new info. |
we still construct a dataset representation for each file which involves reading all coordinates etc. The consistency checking is bypassed at the "concatenation" stage. You could also speed using dask by setting up a cluster and using |
What happened?
I am trying to open an mfdataset consisting of over 4400 files. The call completes in 342.735s on my machine. After running a profiler, I discovered that most of that time is spent reading the first 8 bytes of the file. However, on my filesystem, looking at my system resource monitor, it looks like the entire file is being read (with a sustained 40-50MB of read IO most of that time).
I traced the bottleneck down to
xarray/xarray/core/utils.py
Line 662 in 96030d4
I isolated the essence of this code, by reading the first 8 bytes of each file.
Profiling this on my directory of netcdf files took 137.587s (not sure why this was faster than 264s, caching maybe?). Changing the
fh.read(8)
tofh.read1(8)
, the execution time dropped to 1.52s.What did you expect to happen?
I expected open_mfdataset to be quicker.
Minimal Complete Verifiable Example
MVCE confirmation
Relevant log output
No response
Anything else we need to know?
I cannot share the netcdf files. I believe this issue to isolated, and possibly triggered by the shared filesystems found on supercomputers.
Environment
xarray: 2023.2.0
pandas: 1.5.3
numpy: 1.24.2
scipy: 1.10.1
netCDF4: 1.6.2
pydap: None
h5netcdf: None
h5py: None
Nio: None
zarr: None
cftime: 1.6.2
nc_time_axis: None
PseudoNetCDF: None
rasterio: 1.3.6
cfgrib: None
iris: None
bottleneck: None
dask: 2023.3.1
distributed: None
matplotlib: 3.7.1
cartopy: None
seaborn: None
numbagg: None
fsspec: 2023.3.0
cupy: None
pint: None
sparse: None
flox: None
numpy_groupies: None
setuptools: 67.5.1
pip: 23.0.1
conda: None
pytest: None
mypy: None
IPython: 8.11.0
sphinx: None
The text was updated successfully, but these errors were encountered: