-
-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
slow performance with open_mfdataset #1385
Comments
cc: @geosciz, who is helping with this project. |
Yes, adding an boolean argument But more generally, I am a little surprised by how slow xarray/xarray/core/alignment.py Lines 302 to 306 in ab4ffee
Basically, if |
An update on this long-standing issue. I have learned that As an example, I am loading a POP datataset on cheyenne. Anyone with access can try this example. base_dir = '/glade/scratch/rpa/'
prefix = 'BRCP85C5CN_ne120_t12_pop62.c13b17.asdphys.001'
code = 'pop.h.nday1.SST'
glob_pattern = os.path.join(base_dir, prefix, '%s.%s.*.nc' % (prefix, code))
def non_time_coords(ds):
return [v for v in ds.data_vars
if 'time' not in ds[v].dims]
def drop_non_essential_vars_pop(ds):
return ds.drop(non_time_coords(ds))
# this runs almost instantly
ds = xr.open_mfdataset(glob_pattern, decode_times=False, chunks={'time': 1},
preprocess=drop_non_essential_vars_pop, decode_cf=False) And returns this
This is roughly 45 years of daily data, one file per year. Instead, if I just change There are more of these This is a real failure of lazy decoding. Maybe it can be fixed by #1725, possibly related to #1372. |
@rabernat How does performance compare if you call |
Calling >>> ds.SST.variable._data
LazilyIndexedArray(array=DaskIndexingAdapter(array=dask.array<_apply_mask, shape=(16401, 2400, 3600), dtype=float32, chunksize=(1, 2400, 3600)>), key=BasicIndexer((slice(None, None, None), slice(None, None, None), slice(None, None, None)))) Calling getitem on this array triggers the whole dask array to be computed, which would takes forever and would completely blow out the notebook memory. This is because of #1372, which would be fixed by #1725. This has actually become a major showstopper for me. I need to work with this dataset in decoded form. Versions
INSTALLED VERSIONS
------------------
commit: None
python: 3.6.4.final.0
python-bits: 64
OS: Linux
OS-release: 3.12.62-60.64.8-default
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8
xarray: 0.10.1 |
OK, so it seems that we need a change to disable wrapping dask arrays with |
Was this fixed by #2047? |
I can confirm that
is much faster (seconds vs minutes) than
. For reference, data_fnames is a list of 5 files, each of which is ~75 GB. |
@chuaxr I assume you're testing this with xarray 0.11? It would be good to do some profiling to figure out what is going wrong here. |
Yes, I'm on 0.11. Nothing displays on the task stream/ progress bar when using The output from
and for decode_cf = True:
Using If I repeat the open_mfdataset for another 5 files (after opening the first 5), I occasionally get this warning: I only began using the dashboard recently; please let me know if there's something basic I'm missing. |
@chuaxr What do you see when you use One way to fix this would be to move our call to Lines 270 to 296 in f547ed0
In practice, is the difference between using xarray's internal lazy array classes for decoding and dask for decoding. I would expect to see small differences in performance between these approaches (especially when actually computing data), but for constructing the computation graph I would expect them to have similar performance. It is puzzling that dask is orders of magnitude faster -- that suggests that something else is going wrong in the normal code path for |
Sorry, I think the speedup had to do with accessing a file that had previously been loaded rather than due to Output of
Output of
/work isn't a remote archive, so it surprises me that this should happen. |
Does it take 10 seconds even to open a single file? The big mystery is what
that top line ("_operator.getitem") is but my guess is it's netCDF4-python.
h5netcdf might also give different results...
…On Fri, Nov 16, 2018 at 8:20 AM chuaxr ***@***.***> wrote:
Sorry, I think the speedup had to do with accessing a file that had
previously been loaded rather than due to decode_cf. Here's the output of
prun using two different files of approximately the same size (~75 GB),
run from a notebook without using distributed (which doesn't lead to any
speedup):
Output of
%prun ds = xr.open_mfdataset('/work/xrc/AM4_skc/
atmos_level.1999010100-2000123123.sphum.nc
',chunks={'lat':20,'time':50,'lon':12,'pfull':11})
780980 function calls (780741 primitive calls) in 55.374 seconds
Ordered by: internal time
ncalls tottime percall cumtime percall filename:lineno(function)
7 54.448 7.778 54.448 7.778 {built-in method _operator.getitem}
764838 0.473 0.000 0.473 0.000 core.py:169(<genexpr>)
3 0.285 0.095 0.758 0.253 core.py:169(<listcomp>)
2 0.041 0.020 0.041 0.020 {cftime._cftime.num2date}
3 0.040 0.013 0.821 0.274 core.py:173(getem)
1 0.027 0.027 55.374 55.374 <string>:1(<module>)
Output of
%prun ds = xr.open_mfdataset('/work/xrc/AM4_skc/
atmos_level.2001010100-2002123123.temp.nc
',chunks={'lat':20,'time':50,'lon':12,'pfull':11},
decode_cf=False)
772212 function calls (772026 primitive calls) in 56.000 seconds
Ordered by: internal time
ncalls tottime percall cumtime percall filename:lineno(function)
5 55.213 11.043 55.214 11.043 {built-in method _operator.getitem}
764838 0.486 0.000 0.486 0.000 core.py:169(<genexpr>)
3 0.185 0.062 0.671 0.224 core.py:169(<listcomp>)
3 0.041 0.014 0.735 0.245 core.py:173(getem)
1 0.027 0.027 56.001 56.001 <string>:1(<module>)
/work isn't a remote archive, so it surprises me that this should happen.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#1385 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/ABKS1jmFqfe9_dIgHAMYlVOh7WKhzO8Kks5uvuXKgaJpZM4NJOcQ>
.
|
h5netcdf fails with the following error (presumably the file is not compatible):
Using scipy:
|
I have the same problem. open_mfdatasset is 10X slower than nc.MFDataset. I used the following code to get some timing on opening 456 local netcdf files located in a
The output I get is :
I made tests on a centOS server using python2.7 and 3.6, and on mac OS as well with python3.6. The timing changes but the ratios are similar between netCDF4 and xarray. Is there any way of making open_mfdataset go faster? In case it helps, here are output from for python 2.7:
for python 3.6:
longer output of %prun with python3.6:
output of xr.show_versions()
|
Looks like you're using xarray v0.11.0, but the most recent one is v0.11.3.
There have been several changes since then which might affect this, try
that first.
…On Thu, 7 Feb 2019, 18:53 sbiner, ***@***.***> wrote:
I have the same problem. open_mfdatasset is 10X slower than nc.MFDataset.
I used the following code to get some timing on opening 456 local netcdf
files located in a nc_local directory (of total size of 532MB)
clef = 'nc_local/*.nc'
t00 = time.time()
l_fichiers_nc = sorted(glob.glob(clef))
print ('timing glob: {:6.2f}s'.format(time.time()-t00))
# netcdf4
t00 = time.time()
ds1 = nc.MFDataset(l_fichiers_nc)
#dates1 = ouralib.netcdf.calcule_dates(ds1)
print ('timing netcdf4: {:6.2f}s'.format(time.time()-t00))
# xarray
t00 = time.time()
ds2 = xr.open_mfdataset(l_fichiers_nc)
print ('timing xarray: {:6.2f}s'.format(time.time()-t00))
# xarray tune
t00 = time.time()
ds3 = xr.open_mfdataset(l_fichiers_nc, decode_cf=False, concat_dim='time')
ds3 = xr.decode_cf(ds3)
print ('timing xarray tune: {:6.2f}s'.format(time.time()-t00))
The output I get is :
timing glob: 0.00s
timing netcdf4: 3.80s
timing xarray: 44.60s
timing xarray tune: 15.61s
I made tests on a centOS server using python2.7 and 3.6, and on mac OS as
well with python3.6. The timing changes but the ratios are similar between
netCDF4 and xarray.
Is there any way of making open_mfdataset go faster?
In case it helps, here are output from xr.show_versions and %prun
xr.open_mfdataset(l_fichiers_nc). I do not know anything about the output
of %prun but I have noticed that the first two lines of the ouput are
different wether I'm using python 2.7 or python 3.6. I made those tests on
centOS and macOS with anaconda environments.
for python 2.7:
13996351 function calls (13773659 primitive calls) in 42.133 seconds
Ordered by: internal time
ncalls tottime percall cumtime percall filename:lineno(function)
2664 16.290 0.006 16.290 0.006 {time.sleep}
912 6.330 0.007 6.623 0.007 netCDF4_.py:244(_open_netcdf4_group)
for python 3.6:
9663408 function calls (9499759 primitive calls) in 31.934 seconds
Ordered by: internal time
ncalls tottime percall cumtime percall filename:lineno(function)
5472 15.140 0.003 15.140 0.003 {method 'acquire' of '_thread.lock' objects}
912 5.661 0.006 5.718 0.006 netCDF4_.py:244(_open_netcdf4_group)
longer output of %prun with python3.6:
9663408 function calls (9499759 primitive calls) in 31.934 seconds
Ordered by: internal time
ncalls tottime percall cumtime percall filename:lineno(function)
5472 15.140 0.003 15.140 0.003 {method 'acquire' of '_thread.lock' objects}
912 5.661 0.006 5.718 0.006 netCDF4_.py:244(_open_netcdf4_group)
4104 0.564 0.000 0.757 0.000 {built-in method _operator.getitem}
133152/129960 0.477 0.000 0.660 0.000 indexing.py:496(shape)
1554550/1554153 0.414 0.000 0.711 0.000 {built-in method builtins.isinstance}
912 0.260 0.000 0.260 0.000 {method 'close' of 'netCDF4._netCDF4.Dataset' objects}
6384 0.244 0.000 0.953 0.000 netCDF4_.py:361(open_store_variable)
910 0.241 0.000 0.595 0.001 duck_array_ops.py:141(array_equiv)
20990 0.235 0.000 0.343 0.000 {pandas._libs.lib.is_scalar}
37483/36567 0.228 0.000 0.230 0.000 {built-in method builtins.iter}
93986 0.219 0.000 1.607 0.000 variable.py:239(__init__)
93982 0.194 0.000 0.194 0.000 variable.py:706(attrs)
33744 0.189 0.000 0.189 0.000 {method 'getncattr' of 'netCDF4._netCDF4.Variable' objects}
15511 0.175 0.000 0.638 0.000 core.py:1776(normalize_chunks)
5930 0.162 0.000 0.350 0.000 missing.py:183(_isna_ndarraylike)
297391/296926 0.159 0.000 0.380 0.000 {built-in method builtins.getattr}
134230 0.155 0.000 0.269 0.000 abc.py:180(__instancecheck__)
6384 0.142 0.000 0.199 0.000 netCDF4_.py:34(__init__)
93986 0.126 0.000 0.671 0.000 variable.py:414(_parse_dimensions)
156545 0.119 0.000 0.811 0.000 utils.py:450(ndim)
12768 0.119 0.000 0.203 0.000 core.py:747(blockdims_from_blockshape)
6384 0.117 0.000 2.526 0.000 conventions.py:245(decode_cf_variable)
741183/696380 0.116 0.000 0.134 0.000 {built-in method builtins.len}
41957/23717 0.110 0.000 4.395 0.000 {built-in method numpy.core.multiarray.array}
93978 0.110 0.000 0.110 0.000 variable.py:718(encoding)
219940 0.109 0.000 0.109 0.000 _weakrefset.py:70(__contains__)
99458 0.100 0.000 0.440 0.000 variable.py:137(as_compatible_data)
53882 0.085 0.000 0.095 0.000 core.py:891(shape)
140604 0.084 0.000 0.628 0.000 variable.py:272(shape)
3192 0.084 0.000 0.170 0.000 utils.py:88(_StartCountStride)
10494 0.081 0.000 0.081 0.000 {method 'reduce' of 'numpy.ufunc' objects}
44688 0.077 0.000 0.157 0.000 variables.py:102(unpack_for_decoding)
output of xr.show_versions()
xr.show_versions()
INSTALLED VERSIONS
------------------
commit: None
python: 3.6.8.final.0
python-bits: 64
OS: Linux
OS-release: 3.10.0-514.2.2.el7.x86_64
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_CA.UTF-8
LOCALE: en_CA.UTF-8
xarray: 0.11.0
pandas: 0.24.1
numpy: 1.15.4
scipy: None
netCDF4: 1.4.2
h5netcdf: None
h5py: None
Nio: None
zarr: None
cftime: 1.0.3.4
PseudonetCDF: None
rasterio: None
iris: None
bottleneck: None
cyordereddict: None
dask: 1.1.1
distributed: 1.25.3
matplotlib: 3.0.2
cartopy: None
seaborn: None
setuptools: 40.7.3
pip: 19.0.1
conda: None
pytest: None
IPython: 7.2.0
sphinx: None
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#1385 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AiTXo7EQRagOh5iu3cYenLDTFDk68MGIks5vLHYMgaJpZM4NJOcQ>
.
|
I just tried and it did not help ...
|
Indeed @dcherian -- it took some experimentation to get the right engine to support parallel execution and even then, results are still mixed, which, to me, means further work is needed to isolate the issue. Along the lines of suggestions here (thanks @jmccreight for pointing this out), we've introduced a very practical pre-processing step to rewrite the datasets so that the read is not striped across the file system, effectively isolating the performance bottleneck to a position where it can be dealt with independently. Of course, such an asynchronous workflow is not possible in all situations, so we're still looking at improving the direct performance. Two notes as we keep working:
|
@rabernat wrote:
I seem to be experiencing a similar (same?) issue with open_dataset: |
Hi Tom! 👋 So much has evolved about xarray since this original issue was posted. However, we continue to use it as a catchall for people looking to speed up open_mfdataset. I saw your stackoverflow post. Any chance you could post a link to the actual file in question? |
Thanks, Ryan! Sure-- here's a link to the file: (I could post to a web server if there's any reason to prefer that.) |
In general that would be a little more convenient than google drive, because then we could download the file from python (rather than having a manual step). This would allow us to share a fully copy-pasteable code snippet to reproduce the issue. But don't worry about that for now. First, I'd note that your issue is not really related to |
Thank you, Ryan. I will post the file to a server with a stable URL and replace the google drive link in the other post. My original issue was that I wanted to not read the data (yet), only to have a look at the metadata. |
Ah ok so if that is your goal, There is a problem with the time encoding in this file. The units ( |
Thank you. A member of my research group made the netcdf file, so we will make a second file with the time encoding fixed. |
See deeper dive in #6284 |
I've recently been trying to run
To me the first solution feels better and I can make a pull request to do this. From reading this thread I'm under the impression that there's probably something else going on under the hood that's causing the slowness of |
@fraserwg do you know what the performance bottleneck is in your case? (i.e., if you interrupt the comptuation, what is being computed?) |
It is very common for different netCDF files in a "dataset" (a folder) to be encoded differently so we can't set
There's
What your describing sounds like a failure of lazy decoding or a |
@shoyer I will double check what the bottle neck is and report back. @dcherian interestingly, the |
I'm another person who stumbled across this thread, and found that I appreciate that we can't set this by default, but maybe this could be put into the docstring of It also seems like the xarray version: 2024.2.0 |
@ashjbarnes Are you able to share two small files that illustrate the issue? |
@dcherian Thanks, the files are shared in this google drive folder. This is a spatial subsample of the same files. Perhaps it's a bad idea to store static data like the lat/lon coordinates in each file? The overall size of this is so tiny compared to the 4D data that I left the coordinates there for convenience but I'm not sure whether this has broader implications. In my tests, running:
on ~3000 files of 300mb each had an order of magnitude speedup over the same command with |
Wow, thank you! This is an amazing bug. The defaults say When Once again, this is a consequence of bad defaults for I would follow https://docs.xarray.dev/en/stable/user-guide/io.html#reading-multi-file-datasets and use |
Well done @dcherian great find! Changing the defaults does seem like a really good idea in this case |
@dcherian I am yet another person stumbling on this problem. Unfortunately |
A simpler fix is probably |
This issue is almost seven years old! It has been "fixed" many times since my original post, but people keep finding new ways to make it reappear. 😆 It seems like having better diagnostics / logging of what is happening under the hood with open_mfdataset is what people really need. Maybe even some sort of utility to pre-scan the files and figure out if they are easily openable or not. |
Both of those seem like great ideas. Maybe there could be a verbose or logging mode to help users identify what is wrong with the files (e.g., where the time is being spent and whether that seems suspicious). It is probably true that people (like me) will keep finding new ways to generate problematic netcdf files. (I'm sure we can think of something even worse than 20 Hz data referenced to a time origin 75 years ago). |
Xarray's open_mfdataset is very slow for BOUT++ datasets. Other datasets also have issues (see e.g. pydata/xarray#1385) though the cause may not be the same. Using an implementation that opens the datasets and concatenates significantly speeds up this process.
We have a dataset stored across multiple netCDF files. We are getting very slow performance with
open_mfdataset
, and I would like to improve this.Each individual netCDF file looks like this:
As shown above, a single data file opens in ~60 ms.
When I call
open_mdsdataset
on 49 files (each with a differenttime
dimension but the samenpart
), here is what happens:It takes over 2 minutes to open the dataset. Specifying
concat_dim='time'
does not improve performance.Here is
%prun
of theopen_mfdataset
command.It looks like most of the time is being spent on
reindex_variables
. I understand why this happens...xarray needs to make sure the dimensions are the same in order to concatenate them together.Is there any obvious way I could improve the load time? For example, can I give a hint to xarray that this
reindex_variables
step is not necessary, since I know that all thenpart
dimensions are the same in each file?Possibly related to #1301 and #1340.
The text was updated successfully, but these errors were encountered: