Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Progress bar on open_mfdataset #9521

Open
nordam opened this issue Sep 19, 2024 · 2 comments
Open

Progress bar on open_mfdataset #9521

nordam opened this issue Sep 19, 2024 · 2 comments

Comments

@nordam
Copy link

nordam commented Sep 19, 2024

Is your feature request related to a problem?

I'm using xarray.open_mfdataset() to open tens of thousands of (fairly small) netCDF files, and it's taking quite some time. Being of an impatient nature, I would like to at least be assured that something is happening, so a progress bar would be nice. I found an example of using a progress bar from dask here: #4000 (comment)

However, my attempt to adapt this solution doesn't show a progress bar. Any other options?

Here is the code I tried:

from dask.diagnostics import ProgressBar

with ProgressBar():
    d = xr.open_mfdataset('proc/*.nc')

Describe the solution you'd like

I'd like to see a nice and fairly minimal progress bar, for example telling me how many files have been dealt with so far.

Describe alternatives you've considered

Something based on tqdm would be nice, but could also be something else.

Additional context

No response

@nordam
Copy link
Author

nordam commented Sep 19, 2024

After discussion with a colleague, we ended up with this solution:

import xarray as xr
from dask.diagnostics import ProgressBar


with xr.open_mfdataset('proc/*.nc', chunks=dict(index=1)) as d, ProgressBar():
    d.load()

This works in the strict sense that it displays a progress bar, but unfortunately it does nothing (no progress bar visible) for a couple of minutes (for the set of files I tested), and then the progress bar shows up and runs through in a few seconds. In other words, not very useful for an impatient soul like me.

I should add that I'm testing this in a jupyter notebook.

@keewis
Copy link
Collaborator

keewis commented Sep 21, 2024

indeed, this does nothing if you don't pass parallel=True to open_mfdataset. What that does is parallelize the access to each file by creating one dask task per open_dataset on each file. Without it, open_dataset is called on each file in sequence without going through dask, so you don't get any feedback from dask.

The activity on the progress bar you get is the loading of each chunk into memory, which happens when you call d.load(), and so after the call to open_mfdataset.

What I think you should try is:

import xarray as xr
from dask.diagnostics import ProgressBar


with xr.open_mfdataset('proc/*.nc', chunks=dict(index=1), parallel=True) as d, ProgressBar():
    d.load()

(though you might not need the explicit chunks of index=1, this could also be chunks={})

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants