Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New chunking tutorial #442

Open
adele-morrison opened this issue Aug 15, 2024 · 11 comments
Open

New chunking tutorial #442

adele-morrison opened this issue Aug 15, 2024 · 11 comments
Assignees

Comments

@adele-morrison
Copy link
Collaborator

In @Thomas-Moore-Creative's COSIMA talk, there was a lot of interest in having a tutorial in the Recipes showing best practice for chunking for different types of problems.

e.g. which dimensions to chunk in depending on the problem, how big should chunks be, etc.

This might be a good starting point to base this on:
https://access-nri-intake-catalog.readthedocs.io/en/latest/usage/chunking.html

@Thomas-Moore-Creative
Copy link
Collaborator

@adele-morrison - this is a really good opportunity to talk about community and others contributions. It's a rather circular path of influences. In 2017 it was, as I understand it, COSIMA that brought @jmunroe to Australia and he then helped @dougiesquire and I get Pangeo-style workflows deployed on CSIRO HPC. We became very interested in the importance of chunking and chunking strategies as we had a very large ( and large ensemble ) dataset to deal with. @dougiesquire took a proactive lead on testing strategies for rechunking and zarr format and I learned from that effort. It's very appropriate you link @dougiesquire 's basic notes on chunking.

What would be great is if someone at COSIMA had a real problem that all those interested could work through. The solutions are general but the details really matter (IMO) and going through the process together with a real problem might be a good way to start?

@Thomas-Moore-Creative Thomas-Moore-Creative self-assigned this Aug 15, 2024
@Thomas-Moore-Creative
Copy link
Collaborator

@ongqingyee, @jemmajeffree, et al

Do you have any current problems that could be tackled jointly to build up a first example? I'm attempting to source a small chunk of /scratch we could use collectively to build up some rechunking / ARD workflows together.

@jemmajeffree
Copy link
Collaborator

I agree that focussing on a specific problem is important. I don't think I have any current problems that would be useful for this. I'm currently working with 2D fields in large ensembles, and have thoroughly optimised importing these with xarray, but they were already chunked in a useful dimension anyway. I think, given the COSIMA output is currently chunked {time:1}, then the best example is probably something through time.

With 2D fields and <250 yrs monthly data, my approach is usually just to haul the whole thing into memory and then rechunk for analysis, so we'd probably need to use either daily data or deliberately try and do stuff on few cores for rechunking separately to make a difference.

I'd be interested in helping develop this tutorial, but I'm going to be a bit slow and unreliable while I'm still building up the courage to engage with the COSIMA github.

@ongqingyee
Copy link
Collaborator

@ongqingyee, @jemmajeffree, et al

Do you have any current problems that could be tackled jointly to build up a first example? I'm attempting to source a small chunk of /scratch we could use collectively to build up some rechunking / ARD workflows together.

I have a simple example that can be good. It involves masking out a region around the Antarctic margin and calculating a circumpolar averaged surface speed. In my experience I found that masking and integrating on an xgcm grid makes applying chunking trickier. I'm happy to put together a draft notebook to start and changes/additions can be made? @Thomas-Moore-Creative @jemmajeffree

Working on the same notebook through github is new to me though, so help on the logistics of that would be great.

@Thomas-Moore-Creative
Copy link
Collaborator

@ongqingyee, @jemmajeffree, et al
Do you have any current problems that could be tackled jointly to build up a first example? I'm attempting to source a small chunk of /scratch we could use collectively to build up some rechunking / ARD workflows together.

I have a simple example that can be good. It involves masking out a region around the Antarctic margin and calculating a circumpolar averaged surface speed. In my experience I found that masking and integrating on an xgcm grid makes applying chunking trickier. I'm happy to put together a draft notebook to start and changes/additions can be made? @Thomas-Moore-Creative @jemmajeffree

Working on the same notebook through github is new to me though, so help on the logistics of that would be great.

I'm a bit of a COSIMA outsider so others ( @navidcy ? @anton-seaice ? others ) might have something to say about where best to put your new example notebook on the repo and what practice is for branching? FWIW I'd suggest you start a new branch for this issue and others can then contribute via their own branches off your branch? Again, COSIMA regulars might have other views.

Your problem does seem to have a lot of detail so it would be good to see the code, what the source data is, and the goal for the final output. Thanks.

@navidcy
Copy link
Collaborator

navidcy commented Aug 16, 2024

What do you mean "branching" and "branches of your branch"?
You are referring to repository branches?

The best place for an example is in the recipes directory. Open a PR and submit it. Does this clarify the question above?

@Thomas-Moore-Creative
Copy link
Collaborator

What do you mean "branching" and "branches of your branch"? You are referring to repository branches?

The best place for an example is in the recipes directory. Open a PR and submit it. Does this clarify the question above?

Hopefully I haven't already added confusion for @ongqingyee who was looking for more clarity around githubby things. =)

@edoddridge
Copy link
Collaborator

@Thomas-Moore-Creative is describing some more advanced GitHub techniques than we generally use in this repo.

When someone opens a PR, they have suggested changes that live on a branch in their repo. If I want to make changes to their PR (and I don't have write access to their PR branch), I can open a pull request based on the branch in the pull request. The original PR owner can then merge my PR into their PR, and then we can merge their PR in to the main repo.

As an example, you can look at this PR in MITgcm: MITgcm/MITgcm#47
Gael Forget and Erik van Sebille both made pull requests on to my PR branch. Those changes were then incorporated in to the PR.

@navidcy
Copy link
Collaborator

navidcy commented Aug 16, 2024

Oh I see what you mean @Thomas-Moore-Creative

@Thomas-Moore-Creative
Copy link
Collaborator

Oh I see what you mean @Thomas-Moore-Creative

I think the most important point is that whatever the GitHub practice is that it's simple enough and/or supported enough so newbies can engage and make it to the next level of their Github life.

@ongqingyee
Copy link
Collaborator

What do you mean "branching" and "branches of your branch"? You are referring to repository branches?

The best place for an example is in the recipes directory. Open a PR and submit it. Does this clarify the question above?

This I can do. Thanks all!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants