-
Notifications
You must be signed in to change notification settings - Fork 66
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reading mult-tile output. #28
Comments
This is a great suggestion @dhruvbalwada. Also partially related to #27 raised by @braaannigan. |
Any update on this? I am doing a large run and need to write tiled outputs. It would be great to have this capability! |
Thanks for the ping Matt! What we are doing with llcreader right now is considerably more complicated than reading multi-tiled input. However, matt, it sounds like instead your are interested in writing multi-tiled output. Can you confirm that? What is the source of the data you want to write? Are you calculating it in python? Reading from another file? In both cases, we might be able to hack zarr to do this for us. A zarr store with no compression is equivalent to MDS files on a binary level. |
Sorry for the vague and ambiguous language! I meant that I need to have the mitgcm write to tiled outputs. I would like xmitgcm to be able to read tiled outputs (e.g. of the format T.0000000000.???.???.data). |
Ok so this is something I think I can work on in the next few weeks. What would be very helpful would be to have a test dataset with some example output I could use to debug. Ideally this dataset would be as small as possible (< 5 MB). Just the bare minimum of files needed to try it out. I would need both |
I can easily provide these files from a tutorial. |
Sure, email is fine. We keep our test data here: https://figshare.com/collections/xmitgcm_test_data/4362224 Thanks a lot for your help! |
Turns out its 3.8MB. Find it here: |
Sorry - made it smaller - no need to have it 3.8MB |
Hi all, I'd just like to +1 on this as a very useful feature if it can be implemented. Thanks! |
Hi all, another ping for this feature to be added. At MITgcm we would like to push users into using xmitcgm as the de facto approach to reading in output... happy to help w/testing (tutorial I'm currently working on writes tiled output). Thanks! |
Hello everyone, This feature is something I'd really like implemented, and over the next few weeks I think I should be able to set some time aside to do this. Before I jump in though, I wanted to check a few things:
All the best, |
@fraserwg I think this would be a fantastic thing to work on. The interface you mention in (3.) sounds reasonable. The raw tiled files together with their .meta files are quite similar in spirit to zarr/dask chunking ideas, which xarray understands. @jahn may have some python that could help with ingesting tiled output in ways that can then be mapped to zarr/dask chunking specification. |
Thanks, I've got started and want to write some tests for the code I have so far. Would it be possible to upload the following file to the test data repository as something like When unzipped the directory contains output from a variation of the barotropic gyre tutorial with two levels on a tiled grid, including both grid and data files. |
@fraserwg - yes I can do that later. I was thinking a bit more too. An alternate approach might be to have a script/program that took MITgcm tiled and made it into a Zarr style store. I am not sure whether that is practical, but it might be worth a quick chat/zoom? |
It's great to hear that folks are interested in working on this. I would be thrilled to support the effort.
I think I am the only one who has really volunteered in the past, and obviously I have not actually done it yet. It is unlikely I would have sufficient time any time soon. 🤦
It's not particularly hard to simply read a tiled binary array lazily into a dask array. The hard part is dealing with all of the legacy xmitgcm code around that. Over time, we have had to introduce a lot of complexity into this package, mostly to deal with special cases related to LLC, cubed-sphere, and (the worst!) ASTE grids. One relatively straightforward way to implement this would be to refactor the Lines 87 to 89 in c793ecc
The "virtual zarr" file approach is very appealing in terms of its simplicitly. However, having worked heavily on both Zarr and xmitgcm, I see a major roadblock here: multi-variable outputs. The existence of multi-variable outputs (e.g. a diagnostic file that has 10 different variables packed into a single tile), breaks the simple 1:1 correspondence between MITgcm mds tiles and zarr chunk files. You would need an additional layer of abstraction that deals with this. @christophernhill - is that something you and Oliver have already dealt with? |
I haven't done multiple variables per file with zarr. I would just treat the records/variables as an extra dimension and make them visible as variables at a higher level (as I think @rabernat suggests). That's work, of course. I have a hacky prototype for reading lots of time records (not tiled!) via zarr and dask. It subclasses the zarr DirectoryStore to recognize non-consecutively numbered files as chunks. The routine mdsfiles creates a .zarray file and then opens the dataset using the new store. Something similar might be useful for tiled files. This version doesn't read any metadata, though. Ranges of iteration numbers and per-file shapes have to be provided. One could extract these from meta files. This was mostly for being able to read millions of files into dask in a reasonable time without having to reorganize them. For this to be maintainable in the long term, one would have to get any custom stores integrated into zarr. This is the script in case it could be useful: mdszarr.zip |
This is great, thanks for sharing!
There is zero chance of and MITgcm-specific store being added to zarr-python proper. But we could provide a third-party store implementation that can interoperate with Zarr and use it internally in xmitgcm. FWIW, I feel like this is a big enough departure from the current xmitgcm implementation to warrant a brand new package, e.g. mds-zarr or similar. |
It wouldn't have to be that MITgcm-specific. Flexible number formatting and >1 step size for chunk names would be sufficient. I understand that this may still be a stretch. If there's a procedure for third-party stores that would be good. |
Thanks @christophernhill, @jahn and @rabernat for these suggestions. I've spent a bit of time today exploring/playing around with the source code and have started to go down the route of modifying the I've managed to get the |
There are two options here. As you suggested, you could loop over the tiles and build up a dask dataset using dask.array.concat / dask.array.stack etc. A slightly faster option would be to figure out how to build the entire dask graph in one go. This is how llcreader works. xmitgcm/xmitgcm/llcreader/llcmodel.py Lines 689 to 718 in 3b341ad
|
@fraserwg great - I thinking looping over tiles is fine to start. The Desk and friends graph is nice, but it is not clear it is always worth the effort at this stage - in practice can be a little unpredictable and doesn't always perform as well as one would hope, it sometimes gets in its own ways buffering and juggling lots of pieces. I think if it was me I would do a deterministic behavior concat/stack thing as a first capability. I think that alone would be a big help for many people and make the world a better place 🙏 🥰 . I would be happy to zoom, try some code etc... if useful? |
@christophernhill – a zoom call at some point could be useful. I've changed my approach slightly, so now I'm concatenating the tiles earlier on, in order to reduce code duplication and increase performance. I've just got a prototype working that doesn't seem to break the behaviour of the original code. Loading a tiled dataset set should now be as simple as calling The code could definitely do with some tidying and is currently untested on anything other than the example folder I sent over the other day. It currently doesn't have any tests and the documentation of various functions needs updating - this should be relatively simple to do, but I'm away a fair bit over the next few weeks so this might take a bit of time. After I've set some tests up (or possibly whilst I'm doing that) it would be useful to see how it performs on real life examples - if anyone in this thread is interested in trying the code I'm more than happy to share it in its unpolished form (it's available as a branch on my fork of xmitgcm at the moment). |
For folks following this thread, I have started experimenting with exposing mds data via a zarr adaptor. Proof of concept ie here: https://gist.github.com/rabernat/0149106f271ad384af0c9ad102196621 I definitely think we should rewrite xmitgcm using this new framework. It will be a very different package at the end of that. But none of these tools (zarr, fsspec, reference maker, etc.) existed when I started making xmitgcm. |
I had a similar idea, and posted it in #337 before @cspencerjones pointed me here. The difference is that since the discussion in this thread then (1) kerchunk was made (starting early 2021) (2) virtualizarr was made (3) there is active discussion about chunk manifests becoming part of zarr upstream. Essentially it's still the same basic idea (create a mapping from zarr chunks to chunks of data in mds files), but the more shared infrastructure for creating and managing that mapping (the "chunk manifest") the better. The suggestions here can be thought of as writing a kerchunk / virtualizarr "reader", which inspects mds data and outputs chunk manifests.
I don't know that much about the mds format, but I would think the chunk manifests can deal with this, so long as individual variables are not compressed together (i.e. so long as individual variables live in different chunks, it's okay if those chunks are in the same file). See zarr-developers/VirtualiZarr#218. |
Hello,
I was trying to read multi-tile output using xmitgcm. Files with names such as Varname.iternum.tilenum1.tilenum2.data/meta. xmitgcm does not seem to have the capability to read this output right now, instead users need to output data in single file format (using flag singlecpuio=.TRUE. in data file). It will be nice if someone can add the capability to read tiled data.
Thanks,
Dhruv
The text was updated successfully, but these errors were encountered: