-
Notifications
You must be signed in to change notification settings - Fork 189
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
xesmf included in pangeo.pydata.org ? #309
Comments
This is a great point @naomi-henderson. (FYI, we can't follow the link http://pangeo.pydata.org/user/naomi-henderson/lab/tree/CMIP5-MakeFigure12.5.ipynb. If you would like to share a notebook from pangeo, your best bet right now is to download it and upload it as a gist.) We have discussed xesmf use cases in #197, but your example is much more concrete. It would be a great example to highlight the capabilities of pangeo to the climate community. The main roadblock right now is pangeo-data/helm-chart#29, which is holding us up from adding new packages to pangeo.pydata.org. Hopefully that will be resolved very soon, and we can move ahead with this. |
Ah, thanks @rabernat - I will wait for this to be sorted out. I will also need xESMF to finish the atmospheric moisture budget use case. The ability to do regridding is essential in dealing with the CMIP5 (and soon to come CMIP6) data. |
xesmf is now on the stock software environment on Pangeo. I believe that this is now resolved. I'm curious to see it used. |
@mrocklin fantastic! So far so good, xesmf is happily calculating the weight files and reusing them when appropriate. Thanks much |
I wonder how xesmf will work with distributed. Will it scatter the weight matrix to the workers?
…Sent from my iPhone
On Jun 13, 2018, at 4:57 PM, Naomi Henderson ***@***.***> wrote:
@mrocklin fantastic! So far so good, xesmf is happily calculating the weight files and reusing them when appropriate. Thanks much
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub, or mute the thread.
|
Glad to see this! @naomi-henderson could you share your notebook (I still cannot access it)? Let me try to tweak it to support distributed. |
That would be great, @JiaweiZhuang ! The notebook runs fine (under 3 minutes) on my linux box, but unfortunately gets bogged down at the beginning when reading the zarr data on pangeo.pydata.org. The xesmf regridding is so fast in comparison that I haven't had a chance to check out how the workers are handling this part. (It is monthly data chunked (12,ny,nx) in (time,lat,lon) - is that the problem?) If you scroll down to the bottom of each gist notebook you will see that it takes 4x as long using 10workers/20cores on pangeo.pydata.org as on my linux box (where the zarr-ed data is on an internal drive) with 2 threads. If anyone has words of wisdom on the data reading/ startup delay issues, I am very open to suggestions and would very much like to get this working properly. Here is my notebook (with output) configured/run on linux box and here is the same notebook (with output) configured/run on pangeo.pydata.org |
cc @martindurant , putting this on your backlog if I may. I suspect that you'll enjoy diving into performance issues here. |
I am just heading out for a weekend away, but if no one beats me to it, I can look early next week.
Can you estimate the per-node data throughput, perhaps by doing a trivial operation such as sum(), versus your local disc? Network data rates >100MB/s are typical, but your disc may be orders of magnitude faster. There may be an opportunity to eat the cost just once if the data can be persisted. |
Thanks for looking at this. I will be gone for the next week, but will do some simple benchmarking of GCS vs my local disc when I return. Thanks for showing how to get the time to list all files and get the number of files - very useful! |
I'm curious about the practice of listing files from GCS. Are there ways
that we might be able to get around this? I would expect this to become an
increasingly large issue as our datasets increase in size. I've heard from
others that you should avoid listing buckets generally.
…On Fri, Jun 22, 2018 at 1:42 PM, Naomi Henderson ***@***.***> wrote:
Thanks for looking at this. I will be gone for the next week, but will do
some simple benchmarking of GCS vs my local disc when I return. Thanks for
showing how to get the time to list all files and get the number of files -
very useful!
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#309 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AASszH-ZBIV7LzVHjtF2axC6pN2v-WByks5t_SyXgaJpZM4UlQjI>
.
|
Quite some time ago, now, we realised that it simply took far too long to list all the files in a bucket for data-heavy buckets, and we moved to per-directory listing. This means, though, a new call for every directory, and many calls for a deeply nested set of directories, such as zarr CDF structures seem to have. The following is a set of
(note that I have no idea what the All of this aside, the general rule is still to have data chunks which are large enough that the download outweighs overheads from filesystem operations: for newmann there is 120GB in 1930 files, for CMIP5-ts 11GB in 23800 files |
@alimanfoo, the zarr storage of cdf-like nested group data allows access at any level in the hierarchy, which is very convenient, but I wonder is there any possibility of consolidating the group and attr files such that so many list and open operations would not be required at metadata inference time? |
👍 to this idea
@naomi-henderson: how were the cmip5 files chunked? Would it make sense to use larger chunks? Can we see the repr of one of the datasets? |
cmip5 seems to contain many datasets in directories, here is one,
and here is newmann
|
I chunked the cmip5 data by 12 months. I could certainly use larger chunks, say 20 year, but the run lengths for each of the models can be different and not all chunks would be equal. Note that this is a small subset of the available data for even this simple 2 dimensional (lon,lat) variable and I have only used one ensemble member per model. Ensemble members are run for different lengths of time, so cannot be combined as in the Newmann dataset. Even if we increase the chunks size by a factor of 20 (or no chunking at all - in which case we might as well stick with netcdf, no?) we still have more files per Gbyte of data. The 3 dimensional data (lon, lat and pressure level) will be better, but if we really want to be able to fly through heterogeneous cmip type data (which can't be open_mfdataset-combined), we need to be able to deal with these issues. Would it be useful for me to rechunk the data, upload and try again? |
Re performance I can suggest a couple of possible avenues to explore, not mutually exclusive. The first obvious suggestion is to investigate whether it is strictly necessary to locate and read all metadata files up front. If any of that could be delayed and then done within some parallel work then some wall time could be saved. A second suggestion is to explore ways of making use of the fact that Zarr can use separate stores for data and metadata (that was @mrocklin's idea). All Zarr API calls support both a There are a number of possible ways this could be leveraged. For example, data and metadata could both be stored on GCS as usual but in separate buckets, or separate directory paths within a bucket. This could reduce the cost of directory or bucket listing when locating metadata files. But you could do other things too. E.g., you could store the chunks on GCS as usual (one chunk per object), but invent some way of baking all metadata into a single GCS object, then write a custom Also final quick thought, this may not help at all as it sounds like metadata is already being cached, but there is a LRUStoreCache in Zarr which can be used as a wrapper around either metadata store or data store (or both). Happy to discuss further. |
@alimanfoo would it be possible to optionally collect all metadata from everywhere within a Zarr dataset and bring it into a single file at the top level? This would have to be invalidated on any write operation that changed metadata, but in the write-once-read-many case it might be helpful. |
@mrocklin yes, you could do something like this now via the store/chunk_store API without any code or spec changes in Zarr. Assuming data is prepared on a local file system first then uploaded to GCS, steps could be something like... (0) Build the Zarr files and upload to GCS as usual (already done). (1) Decide on how to pack metadata from multiple files into a single file. Lots of ways you could do this, e.g., you could just build a big JSON document mapping the relative file paths to file contents, e.g.:
(2) Write a script to crawl the directory structure for metadata files and build the combined metadata file. (3) Upload the combined metadata file to GCS, put it at the root of the hierarchy under a resource name like ".zmeta" (or put it wherever you like). (4) Implement a Mapping interface (e.g., called something like Then you could open a Zarr hierarchy with something like:
For access via xarray the xarray API may have to be changed to pass through both store and chunk_store args. N.B., this assumes data are read-only once in GCS, so there is nothing like invalidation of the combined metadata file. But for this use case I don't think that would be necessary. Happy to elaborate if this isn't making sense yet. Btw also happy to discuss spec changes and/or features to support this kind of thing natively in Zarr, but trying to exhaust all possibilities for hacking this without spec or code changes first. |
From the pangeo perspective, reading such a combined metadata file would presumably require (subtle) changes in xarray code. |
On the xarray side in theory all you'd have to do is allow a user to provide both One other thing I noticed, some of the exists calls look odd, e.g., can't think why the following would be needed:
...or why an exists would get called on both array and group at the same path, e.g.:
|
Yeah, I can't say where those off exists calls are coming from, but of course only files that are found are loaded, which can still be many small files. I suppose if It should be pretty simple to make a function to collect all the group/array/attrs into a single JSON blob and store it in the top-level directory, and have xarray look for it, and use it as the chunks_store if available. The function would only we called rarely by users who know the data set is to be read-only. The function itself could live in xarray or zarr itself, but the xarray code would need to change to, at the minimum, allow chunks_store or (better) look for the aggregated JSON file. In the case in hand, where there are many small datasets not in a single group structure, all this may not make a massive difference. |
@naomi-henderson , there are two concrete things here that can be done: 1) implement zarr metadata consolidation, 2) rechunk your data into larger pieces. I intend to work on 1), but 2) can help you more immediately, even though the total file sizes are not very big, as you pointed out. |
On Tue, 26 Jun 2018 at 17:42, Martin Durant ***@***.***> wrote:
Yeah, I can't say where those off exists calls are coming from, but of
course only files that are found are loaded, which can still be many small
files. I suppose if .zarray doesn't exist, the code next tries .zgroup.
It may be that zarr is doing more exists checks than it needs to, and/or
more loading of metadata files than it needs to. If at any point you think
it's worth digging into this, happy to look into it too. Would just need to
capture the zarr API calls being made by user code and the corresponding
list of calls being made at the storage layer.
It should be pretty simple to make a function to collect all the
group/array/attrs into a single JSON blob and store it in the top-level
directory, and have xarray look for it, and use it as the chunks_store if
available. The function would only we called rarely by users who know the
data set is to be read-only. The function itself could live in xarray or
zarr itself, but the xarray code would need to change to, at the minimum,
allow chunks_store or (better) look for the aggregated JSON file.
FWIW initially I'd be inclined to just expose the chunk_store arg in xarray
and leave everything else to user code for the moment, i.e., not yet fix on
any convention for where consolidated metadata file is stored or what
format the file uses. Would be good to have some flexibility to explore
options.
In the case in hand, where there are many small datasets *not* in a
single group structure, all this may not make a massive difference.
Yep, fair enough.
|
@martindurant I have rechunked the data for 120 months instead of 12, with better performance. Well, at least so that the openzarr step takes about the same as the first reduction (computing 12 month averages). The file system listing to find the models is negligible compared to the openzarr step. Here are typical results, but note that the openzarr step for the smaller chunked case is very unpredictable, sometimes pausing for > 1 minute between reads of one model and the next (different models each time). on pangeo.pydata.org with the default 20 workers:
For comparison (perhaps unfairly), on my linux machine (32 threads available):
|
btw, why do the permissions on the .zgroup and .zattrs files, created by to_zarr, need to to be so restrictive? would it cause trouble if I allowed group/other read access?
|
That looks pretty odd, and I'm sure making them broader is fine. Indeed, actually there is an argument in general to make them all read-only for typical write-once datasets. |
^ and, of course, it is generally more worthwhile attempting higher compression ratios for slower bandwidth storage - but that only affects the download time, not the connection overhead. I wonder if we should have a utility function that can estimate the disc size of some array for a few common compression configs and/or encodings. |
FWIW for maximum speed I use If the size of the network pipe on google cloud is ~100 Mb/s then any compressor that decompresses faster than that is probably worth a try. The plot below was made on a particular type of genomic data and so may not translate for geo data, but may give a general sense of relative performance for different compressors (numbers next to each bar like "61.1X" are the compression ratios): |
@naomi-henderson , is it useful for me to replicate your data to another GCS location and make consolidated metadata files, and point you to a version of zarr that would make use of them? That might solve your immediate problem and provide good motivation for the consolidation PR over on zarr. I can do this too, but I don't know when I would get to it. Note that I do not expect metadata consolidation to have much of an effect when working with local data. |
yes @martindurant , I think that would be very helpful - I am curious if it is the repeated metadata or the latency which is the main issue here. What do you need in order to consolidate the metadata, the original netcdf or the chunked zarr files? By the way, I am currently using 120 month chunked data (in pangeo-data/CMIP5-ts/120chunk/), but since the resolution varies from model to model, this does not guarantee a fixed chunk size. Note that I could make the chunk a more uniform size (e.g., @rabernat 's 20M) if I allow the number of months to vary from model to model based on the resolution of the data. Here are the chunk sizes if I use 120 months per chunk:
|
@naomi-henderson , you could install my zarr branch The function Of course, it would be best not to run the consolidation on any original data - only on stuff no one else will be using, or that can be easily re-made, should the process somehow break what is already there. |
thanks @martindurant ! I think I need to down-rev my xarray? Your pip install worked, but then to_zarr throws the error:
What version of xarray are you using? I am at '0.10.7' |
That zarr appears to have version |
ah, your zarr thinks it is 0.4.1.dev843 ? |
Huh, I have no idea where 0.4.1.dev843 comes from. You could instead grab the repo
|
@martindurant I suspect that if you push tags to your fork then this
problem will go away. Sometimes with versioneer it bases the version on
the most recent tag. If you don't have recent tags in your fork then your
version can appear to be quite old.
…On Mon, Jul 2, 2018 at 5:10 PM, Martin Durant ***@***.***> wrote:
Huh, I have no idea where 0.4.1.dev843 comes from. You could instead grab
the repo
> conda create -n testenv python zarr xarray gcsfs ipython -c conda-forge
> conda activate testenv
> git clone https://github.com/martindurant/zarr
> cd zarr
> git checkout consolidate_metadata
> pyhon setup.py install
> ipython
In [ ]: import gcsfs
...: import xarray as xr
In [ ]: %%time
...: gcs = gcsfs.GCSFileSystem(token='anon')
...: mapper = gcsfs.GCSMap(gcs=gcs, root='mdtemp/CMIP5-ts/CanESM2/rcp45/')
...: d = xr.open_zarr(mapper)
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#309 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AASszEjeOT3oDAzN1Ip46iuVQ6n_aA10ks5uCoxLgaJpZM4UlQjI>
.
|
Thanks, @mrocklin , that's it
|
@naomi-henderson if you pip install again all should be well |
zarr-2.2.1.dev6 it is! thanks @martindurant and @mrocklin |
@martindurant , perhaps I don't quite understand the process. If I already have the zarr data, then I should be able to run consolidate_metadata and it should add a file called .zmetadata
|
I should have been clearer - the function works on a mapping, so for local paths you would do
for working directly on GCS
|
thanks @martindurant , it is working like a charm. With consolidated metadata:
Original data
|
Wow, @martindurant , I can't wait to use this on pangeo.pydata.org ! I can't start a server this afternoon for some reason, though. Working directly on GCS from my Columbia linux server I see a factor of 4 improvement on opening the 49 historical model simulations: ~4 min (original zarr) and ~1 min (consolidated metadata). The first reduce step (annual means), which touches all of the data, takes about the same time for original and consolidated. For now I can use your zarr to fix the data, thanks! |
No, currently it will cast distributed dask arrays to in-memory numpy arrays. In @naomi-henderson's example, the original data were reduced significantly (taking time average) before passing to xESMF. At that stage there's no need of distributed computing at all. If the workflow is to reduce a large mount of data and finally make some 1D&2D plots, I'd argue that distributed regridding is rarely needed. It is much better to take average before regridding. Distributed regridding would only be useful for read-regrid-write workflow, to pre-process large volumes of data for later analysis. See more at #334. |
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. |
We should follow up on this discussion. Has their been any progress on either the xesmf side or the zarr side that brings us closer to making this use case work well? |
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. |
This issue has been automatically closed because it had not seen recent activity. The issue can always be reopened at a later date. |
Hi all,
I have put together a use-case for reproducing the canonical ICCP-AR5 Figure12.5 using the CMIP5 multi-model surface temperature data. I have converted the needed netcdf files to zarr and uploaded to pangeo.pydata.org. I just noticed that the xesmf packages has not been included, so I can't do the fast and easy spatial regridding provided by xesmf to combine all models on the same grid. See:
use case on pangeo.pydata.org.
Any chance we can add xesmf to the standard packages?
Here is the figure this notebook can reproduce:
The text was updated successfully, but these errors were encountered: