Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use Case Notebook for "Atmospheric Moisture Budgets" #1

Closed
5 tasks
rabernat opened this issue Aug 30, 2017 · 54 comments
Closed
5 tasks

Use Case Notebook for "Atmospheric Moisture Budgets" #1

rabernat opened this issue Aug 30, 2017 · 54 comments
Assignees
Labels

Comments

@rabernat
Copy link
Member

We need a preliminary notebook that uses Xarray to calculate something related to this use case. It doesn't have to be super long or complicated. It just has to express a real-world, scientifically relevant calculation on a large dataset.

Requirements:

  • Carefully explain the science behind the calculation in words and equations.
  • Clearly document the input dataset (preferably a standard product like ERA-interim or NCAR large ensemble)
  • Use Xarray to load the dataset and, as much as possible, to make the calculations
  • Where Xarray can't accomplish the necessary tasks, clearly document the other functions and / or outside packages that need to be employed
  • If the analysis has been implemented in other languages (e.g. MATLAB, Ingrid, etc.), provide a link to the code and, if possible, the output (which can be used as validation)

cc: @naomi-henderson

I would like to use this as a template for the other three use cases, so I would appreciate general feedback on the checklist above. Are these the right requirements?

@rabernat
Copy link
Member Author

We should add something about setting up the environment and enumerating the dependencies.

Here is my guide to setting up a conda environment for xarray work:
https://medium.com/@rabernat/custom-conda-environments-for-data-science-on-hpc-clusters-32d58c63aa95

But @darothen's "getting started" guide for gcpy is better and more comprehensive:
http://www.danielrothenberg.com/gcpy/getting_started.html

@rabernat
Copy link
Member Author

You also asked about the gist extension:
https://github.com/mozilla/jupyter-notebook-gist

@darothen
Copy link
Member

Please feel free to cannibalize my guide. If I can help in any way, do let me know.

@spencerahill
Copy link

Curious, what motivated this (and "the other three") use case? Just something that satisfies your above constraints?

I ask because I have worked on the problem of closing atmospheric column tracer budgets computed post-hoc. Depending on how concerned you are with closure, this can be pretty involved. The Appendices of our recent paper touch on this: http://journals.ametsoc.org/doi/abs/10.1175/JCLI-D-16-0785.1.

But I suspect this is more detail than you are looking for? If not, I can provide the Python functions that I defined for these efforts (they relied on my infinite-diff package and were computed via an aospy pipeline).

@rabernat
Copy link
Member Author

@spencerahill: hopefully the project description I published earlier today clarifies what I meant by "use cases": these are the official use cases from the NSF proposal. However, we are certainly in need of as many use cases as possible and welcome contributions from everyone. If you have use cases you want to contribute, I would submit a PR to the pangeo-tutorials repo (currently empty, but hopefully will fill up with cool stuff!)

@rabernat
Copy link
Member Author

rabernat commented Oct 4, 2017

Just a heads-up that the lack of any use case notebooks is now blocking progress on our project. It would be great to get some basic use cases checked in. They don't have to be fancy! Any sort of semi-realistic workflow on real data is enough for the systems people to move forward with analyzing performance.

@naomi-henderson
Copy link
Contributor

The calculation of the eddy moisture transports is the most intensive I/O part of the budgets, so I am developing a preliminary notebook which computes the monthly mean co-variances from 6 hourly (e.g., u'q'). I will get the notebook working on cheyenne for the CMIP5 version of CESM. Once it is working, I will submit it to the group for critique before cleaning it up. Then I will start working on the vertical integrals, etc. Sound ok?

@mrocklin
Copy link
Member

mrocklin commented Oct 5, 2017

@naomi-henderson if you run into performance issues (like I/O bottlenecks) and want to try scaling things let me know. I'd be happy to try to collaborate to try to accelerate things.

@rabernat
Copy link
Member Author

rabernat commented Oct 5, 2017

@naomi-henderson your plan sounds perfect

Also feel free to ping us for help with any xarray or general python questions you may encounter while developing your notebook. I know you're still spinning up on these tools, so we are here to help with that too.

@naomi-henderson
Copy link
Contributor

@rabernat - I have made a preliminary version of an eddy covariance notebook and it works fine, albeit slowly, on my local machine (one year has to process 24G). I am having a few issues with the time grids, groupby, etc, as you can see here: covar notebook I may have wandered off into the deep end ...

@rabernat
Copy link
Member Author

Awesome @naomi-henderson! This looks like a great start. I'm very pleased you have gotten your calculation to a preliminary working version.

I would love to sit with you for an hour on Wednesday and go though this in detail. It should be possible to eliminate the loops over files. Dask is unable to parallelize such operations, and they can also be confusing to read. The idea is to first create one big xarray dataset with everything in it (although not in memory) and then perform the desired grouping / aggregation on it using xarray's built in capabilities. You should never have to manually accumulate an average, as you do in your notebook--although this is a very natural place to start for someone coming from matlab or another lower level language!

I'm free all day on Wednesday, so please suggest a time that works for you!

@mrocklin
Copy link
Member

It might be useful to keep this notebook around to show before-and-after examples. This Rosetta-stone-style comparison might help other scientists more accustomed to C++/Fortran style development understand XArray-style development and vice versa.

@mrocklin
Copy link
Member

We only have a few opportunities for this kind of comparison before everyone gets exposed to both types.

@naomi-henderson
Copy link
Contributor

Thanks @rabernat, I was hoping you would have time to help me with this! Does 10am on Wed work for you?

@rabernat
Copy link
Member Author

rabernat commented Oct 10, 2017 via email

@naomi-henderson
Copy link
Contributor

Thanks to @rabernat this morning, I have a new version of the preliminary eddy covariance notebook: New version, no loops. I love the suggestion by @mrocklin to keep my preliminary notebooks around to help others migrate to the xarray/dask world!

I hope to check it out on cheyenne (using the 'export LD_LIBRARY_PATH=' temporary fix) and
perhaps even using the dask dashboard at some point.

My next step is to make a new notebook which does the vertical interpolation of the monthly data from sigma levels to pressure levels. I know that there are python scripts to do this on the command line file by file, but we need something to work on the whole xarray Dataset. Expect many questions.

@mrocklin
Copy link
Member

Cool looking notebook @naomi-henderson ! Makes me wish I understood what you all were doing :)

Do you have any thoughts about the experience? Both on what felt pleasant and on what felt frustrating at first? It would be useful to understand your thought process as you're going through this.

@mrocklin
Copy link
Member

(only if you feel like sharing, didn't mean to put you on the spot)

@naomi-henderson
Copy link
Contributor

I forgot to mention the (many) reasons why the eddy covariance notebook is still preliminary. Although it works for the NCAR/CCSM4 model, it will not work for the models with multiple ensemble members. This will require adding a new 'member' dimension.

The notebook also needs various checks for the existence of the correct variables as well as debugging for all possible calendars (leap, no leap, 360 days/year, etc).

@rabernat
Copy link
Member Author

rabernat commented Oct 11, 2017

Makes me wish I understood what you all were doing :)

@mrocklin: the markdown cell at the end of the notebook gives a description of the calculation and data used. But it is probably still assumes too much domain knowledge and uses lots of jargon. Let us know how we can improve this to make it more understandable to a wider audience.

@rabernat
Copy link
Member Author

rabernat commented Oct 11, 2017

I forgot to mention the (many) reasons why the eddy covariance notebook is still preliminary. Although it works for the NCAR/CCSM4 model, it will not work for the models with multiple ensemble members. This will require adding a new 'member' dimension.

The notebook also needs various checks for the existence of the correct variables as well as debugging for all possible calendars (leap, no leap, 360 days/year, etc).

@spencerahill: any thoughts on how aospy might fit into the CMIP data processing workflow that Naomi is describing above?

@spencerahill
Copy link

Although it works for the NCAR/CCSM4 model, it will not work for the models with multiple ensemble members. This will require adding a new 'member' dimension.

The notebook also needs various checks for the existence of the correct variables as well as debugging for all possible calendars (leap, no leap, 360 days/year, etc).

any thoughts on how aospy might fit into the CMIP data processing workflow that Naomi is describing above?

These kind of things are definitely in the spirit of aospy, but it may be overkill to implement a full aospy-based workflow in this case. Frankly, we require a fair amount of boilerplate at present, which I think cuts against the intent of the notebook. (Perhaps, @rabernat, a separate notebook specifically presenting an aospy workflow for automating calculations across CMIP runs would be welcome?)

For ensemble members, this is where xarray's data model shines: insert a new dimension into each dataset called e.g. ens_mem, which then they get combined over via xr.concat. It could be without a coordinate, in which case it simply indexes the different ensemble members. Or, you could add an ens_mem coordinate as well that stores the ensemble member's string identifier, which for CMIP data looks like 'r1i1p1','r2i1p1', etc. Also, aospy's support for ensemble members is currently broken: see spencerahill/aospy#76 😆

For the calendars, again xarray does the heavy lifting -- so long as the calendars and time data are CF-compliant (which CMIP data all should be), xarray should decode it properly without much trouble. Where aospy adds additional functionality is in dealing with dates outside the (dare I say narrow) range of permissible dates for np.datetime64 (@naomi-henderson, see e.g. here for background). This is not an issue for the example given currently in the notebook, but it is in many other cases.

See e.g. aospy.utils.times.numpy_datetime_workaround_encode_cf, which re-encodes the time array of a Dataset with out-of-bounds dates that has not been CF-decoded such that, when it is decoded, the time array starts at (nearly) the earliest permissible date. @spencerkclark has much more expertise on calendars/datetimes than me, both within aospy and without, so Spencer feel free to chime in.

@naomi-henderson
Copy link
Contributor

Thanks @spencerahill for the hints on creating a new dimension for ensemble member.

I have now tried the eddy covariance notebook on cheyenne, using Jupyter Lab and the Dask dashboard. @mrocklin and @rabernat , I am certainly having performance issues, but it looks like it is mostly reading in the CMIP5 data from '/glade/p/CMIP/CMIP5/output1/...'

The open_mfdataset call
%time ds = xr.open_mfdataset(all_files, preprocess=set_correct_coords)
takes over 5 min on cheyenne:

CPU times: user 4.93 s, sys: 3.17 s, total: 8.1 s
Wall time: 5min 13s

Whereas the same on my linux box here at Lamont (data stored on internal 4Tb drive) takes less than 10s:

CPU times: user 5.91 s, sys: 5.51 s, total: 11.4 s
Wall time: 8.31 s

However the resampling and load() from 6-hourly to monthly
%time dsr = ds.resample(freq='M', dim='time').load()
on cheyenne takes:

CPU times: user 2min 24s, sys: 2min 11s, total: 4min 35s
Wall time: 3min 26s

compared to slightly longer time on my linux box:

CPU times: user 4min 12s, sys: 5min 44s, total: 9min 57s
Wall time: 4min 15s

This is using the sample dask.sh - two nodes, 36 cores each on cheyenne.

@mrocklin
Copy link
Member

How many files are in your dataset? @rabernat is this possibly the slow file system thing going on where we might want to avoid opening every individual file to check metadata?

@pwolfram
Copy link

Is this related to the 'too many open files' issue we had in xarray a while back?

@naomi-henderson
Copy link
Contributor

naomi-henderson commented Oct 13, 2017

In this test case, I open only 12 files in order to process a single year of a 55-year dataset. Is there a way to turn off the metadata and coordinate checking entirely? Ryan and I did pre-process the files, trying to avoid of the overhead of checking metadata.

Since it is only a problem on cheyenne and not on my linux box I was assuming it was something in the file system. The directory structure and files are the same on both.

@pwolfram
Copy link

I'm guessing if you try %prun ds = xr.open_mfdataset(all_files, preprocess=set_correct_coords, autoclose=False) you get the same results because autoclose=False by default: https://github.com/pydata/xarray/blob/27132fba1e465955e76ed155ba6cbb769c0904df/xarray/backends/api.py#L135, so this is something else as @rabernat notes.

@mrocklin
Copy link
Member

Seeing all the time being spent in _thread.lock.acquire is a signal that actually this code is running within a multi-threaded system (like local-machine dask) and that the profile isn't able to easily see within it.

The solution is to tell dask to use a single local thread and then try again

import dask
dask.set_options(get=dask.local.get_sync)

I get the following profile output

         33579 function calls (32771 primitive calls) in 18.746 seconds

   Ordered by: internal time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
    12065   17.978    0.001   17.978    0.001 {method 'is_dir' of 'posix.DirEntry' objects}
      2/1    0.300    0.150    0.326    0.326 {built-in method _imp.create_dynamic}
        6    0.209    0.035    0.209    0.035 {built-in method _operator.getitem}
       38    0.098    0.003    0.098    0.003 {built-in method posix.lstat}
        1    0.074    0.074    0.074    0.074 netCDF4_.py:182(_open_netcdf4_group)
      835    0.029    0.000   18.012    0.022 glob.py:114(_iterdir)
        3    0.011    0.004    0.011    0.004 {method 'read' of '_io.FileIO' objects}
       54    0.007    0.000    0.007    0.000 {built-in method posix.stat}
       82    0.004    0.000    0.004    0.000 {built-in method posix.scandir}
        3    0.004    0.001    0.016    0.005 <frozen importlib._bootstrap_external>:830(get_data)
        2    0.003    0.001    0.003    0.001 {netCDF4._netCDF4.num2date}
        2    0.002    0.001    0.002    0.001 conventions.py:222(nctime_to_nptime)
  350/216    0.001    0.000    0.002    0.000 indexing.py:363(shape)
       49    0.001    0.000    0.001    0.000 {method 'getncattr' of 'netCDF4._netCDF4.Variable' objects}
      238    0.001    0.000    0.001    0.000 posixpath.py:73(join)
     2243    0.001    0.000    0.001    0.000 {built-in method builtins.isinstance}
     31/4    0.001    0.000    0.002    0.000 sre_parse.py:470(_parse)
      717    0.001    0.000    0.001    0.000 {method 'match' of '_sre.SRE_Pattern' objects}
       36    0.001    0.000    0.001    0.000 {method 'getncattr' of 'netCDF4._netCDF4.Dataset' objects}
       38    0.001    0.000    0.008    0.000 <frozen importlib._bootstrap_external>:1233(find_spec)
       26    0.000    0.000    0.009    0.000 <frozen importlib._bootstrap>:861(_find_spec)
       43    0.000    0.000   17.779    0.413 glob.py:79(_glob1)
       14    0.000    0.000    0.002    0.000 netCDF4_.py:220(open_store_variable)
   145/39    0.000    0.000    0.236    0.006 glob.py:132(_rlistdir)

@mrocklin
Copy link
Member

So there is a ton of filesystem stuff going on (which is not surprising). I'm also working through the notebook now with a cluster active and am seeing some odd behavior. If anyone with knowledge of XArray internals has some time I would appreciate a video chat where I screenshare and we look at a few things. There is a lot of short-sequential behavior going on.

cc @jhamman or @shoyer if he is interested in seeing things run on a cluster.

@mrocklin
Copy link
Member

This line can take a surprisingly long time when we first start up

all_files = glob(os.path.join(ddir, '**/*/latest/[h,u,v]*/*1960*.nc'), recursive=True)

@mrocklin
Copy link
Member

The odd behavior that I'm seeing is tons of very small sequential computations. It is as though somewhere in XArray there is a for loop calling compute many times over each of the chunks.

@rabernat
Copy link
Member Author

Is that sequential computation happening during open_mfdataset?

@mrocklin
Copy link
Member

Yes, also resample

@mrocklin
Copy link
Member

If you'd like to see I can easily spin up a cluster and screenshare

@mrocklin
Copy link
Member

Or maybe I'll make a quick screencast

@rabernat
Copy link
Member Author

rabernat commented Oct 13, 2017

Neither of these operations is parallelized. They are basically loops.

open_mfdataset just opens each file in a list and then loops over them again when concatenating.

resample works like groupby: it loops over each group. The inability to parallelize groupby is a known issue (pydata/xarray#585) which may have just very recently been fixed by pydata/xarray#1517). (Not sure that PR gets all the way to groupby though.)

You have arrived one of the cruxes of xarray's performance bottlenecks. Welcome!

@rabernat
Copy link
Member Author

If you'd like to see I can easily spin up a cluster and screenshare

I personally don't need to see. I think I know exactly what your bokeh dashboard will show, because I see it myself every time I do these things: serial dispatching of dask tasks to the cluster.

A screencast would be very useful, however, to communicate the problem to a wider audience (e.g. xarray devs).

@mrocklin
Copy link
Member

@mrocklin
Copy link
Member

@rabernat can you point me to that loop?

@rabernat
Copy link
Member Author

@rabernat can you point me to that loop?

You mean the one in groupby? I'm pretty sure the core loop is here:
https://github.com/pydata/xarray/blob/master/xarray/core/groupby.py#L609

@jhamman
Copy link
Member

jhamman commented Oct 13, 2017

@mrocklin - I'm guessing you're working off the last release of xarray. I think pydata/xarray#1551 and a few other associated improvements may help with some of what you're seeing here. These changes are on the master branch now and may be worth trying out.

@mrocklin
Copy link
Member

Yes, that helped considerably. I also made the chunk sizes along time to be smaller, which both helped a lot with the parallelism and avoided crashing the main machine

ds = xr.open_mfdataset(all_files, chunks={'time': 10}, preprocess=set_correct_coords)

The computations at the end happen in around 20-30 seconds now. I think that the next largest performance boost for them would be to enable persist so that we can keep data in RAM rather than recompute each time.

@mrocklin
Copy link
Member

Yeah, most of the time in the plotting lines is just redoing the data loading from disk.

Updated Notebook: https://gist.github.com/4800484676155745f3cf318bce2e78a4

@shoyer
Copy link

shoyer commented Oct 13, 2017

Currently, a call to open_mfdataset calls open_dataset many times in a loop, which means that coordinate data is loaded serially to create indexes. We could significantly speed this up by refactoring to either:

  1. loading coordinate data in parallel, or
  2. skip loading coordinate data in cases where we know it should be consistent ahead of time. This should probably use a defined standard like NcML. Googling turns up a few Python libraries: 1, 2. Probably it would be good idea to chat with the THREDDS developers about this, too.

@naomi-henderson
Copy link
Contributor

Much better @mrocklin ! The new version is also working much faster on my machine.

Howver, I am having trouble with your new resample line:
dsr = ds.resample(time='M').mean()
which does not work on my linux box:

TypeError                                 Traceback (most recent call last)
<timed exec> in <module>()

TypeError: resample() got an unexpected keyword argument 'time'

instead I stilll need:
ds.resample(freq='M', dim='time')

Is this a difference in our versions of xarray? (0.9.6)

@mrocklin
Copy link
Member

Yes, I think so. I got this warning message when I switched to xarray master. I suspect that if you also switch then things will get even faster for you as well.

pip install git+https://github.com/pydata/xarray.git --upgrade

@rabernat
Copy link
Member Author

Now that we have the sphinx documentation site working, this use-case notebook can be added into docs/use_cases/.

@naomi-henderson: would you be comfortable making a pull request to add the latest and greatest version of your notebook to this repo?

@naomi-henderson
Copy link
Contributor

@rabernat: I will give it a try, but it will have to wait until tomorrow. I will let you know if I have any trouble with the sphinx.

@rabernat
Copy link
Member Author

great! you don't necessarily have to build the docs yourself locally. You can just add your notebook in the docs/use_cases/ folder and then add it to the toctree in docs/index.rst.

@stale
Copy link

stale bot commented Jun 15, 2018

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the stale label Jun 15, 2018
@stale
Copy link

stale bot commented Jun 22, 2018

This issue has been automatically closed because it had not seen recent activity. The issue can always be reopened at a later date.

@stale stale bot closed this as completed Jun 22, 2018
TomAugspurger pushed a commit that referenced this issue Jul 20, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

8 participants