Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add .sel and .isel equivalent that returns all vars on all grids related to the indexers? #200

Open
willirath opened this issue Apr 30, 2020 · 23 comments

Comments

@willirath
Copy link
Contributor

I'm working on a project that aims at providing observation-like sampling for Ocean model data living on a C grid. What we aim at is keeping the C grid logic intact as long as possible. For this, something like

ds.xgcm.isel(X=3, Y=5)

returning a full dataset restricted to x_c=3, x_l=3, x_r=3, y_c=5, y_l=5, y_r=5and a correspondingds.xgcm.sel(...)` would be ideal.

A related question is: Which grid points are "related" to a given position?

If we consider the tracer grid of a finite volume model the basis of our selection, then for a complete description of the dynamics of the selection, we'd not only need the vars defined on the faces for one side (per dim), but for both sides of the box.

(Thanks @jbusecke for the chat yesterday. Definitely helped me getting a clearer idea on what I'm looking for.)

@jbusecke
Copy link
Contributor

jbusecke commented Apr 30, 2020

I think this would be a really nice addition to xgcm. I have been looking for this functionality for a long time to subset large datasets. This could also be a partial solution to e.g. the issue @miniufo mentioned in #193.

I would propose that a call could look like this

grid_sub, ds_sub = grid.isel(ds, x_c=3, y_c=slice(0,5), to=outer’, boundary=fill’)

By using the actual dimension (x_c) and a specifier (not sure if to is the most intuitive), the user has maximum flexibility.
In your case you want to select based on tracer points but get both neighboring velocity points. ‘outer’ will do the trick and also keep existing xgcm lingo for clarity.
If for some reason, another user wants to select based on velocity points, and the select only the tracer point to the left, that is also possible.

xgcm should set some reasonable defaults for to and boundary based on the original grid object to keep a simple sub setting command short.

Internally I’d like to wrap all of the isel/sel logic from xarray (profiting from possible future improvements) and just getting a logical index from there. Xgcm then would have to deal with constructing matching indicies for other dimensions along the same axis.

@rabernat
Copy link
Contributor

Good idea. this would work well in combination with #197 (xarray accessor). Then we could do

ds.grid.isel(X=slice(5, 10))

I slightly prefer @willirath's API over @jbusecke's. The whole point of this is to not have to worry about the different dimension names and be able to select on an axis.

@jbusecke
Copy link
Contributor

Fair point. Could we maybe add some more keyword to allow the flexibility if wanted? e.g. from could default to center but also be left/right?

This would be hidden for the casual use, but still make the use case, I described possible?

Would love to see this working with the accessor. What would the output of that call be? Just a dataset or both a new grid and ds?

@dcherian
Copy link
Contributor

This lines up with the cf xarray discussion: pangeo-data/pangeo#771.

@dcherian
Copy link
Contributor

dcherian commented Jun 14, 2020

If you have the axis attribute set, you can now do this with cf-xarray: https://cf-xarray.readthedocs.io/en/latest/examples/introduction.html#Slicing

EDIT: that's not exactly right. It won't take into account the grid layout. cf_xarray implements the grid-unaware version of this feature

@jbusecke
Copy link
Contributor

Oh this is super cool. I will try to have a closer look soon. Do you think this will overlap in functionality with xgcm? I will have to dive deeper into cf-xarray, but want to make sure that we are not duplicating efforts here.

@jbusecke
Copy link
Contributor

I am still very keen on having this feature. Just wanted to drop a usecase that I encountered the other day:
It would be awesome if xgcm.sel() could retroactively figure out how to select fields on other grid locations. Say I have a tracer field and selected a subset, I think it would be amazing if there would be a way to select the velocity fields in a coherent way. This might however be a bit more complicated, depending on how the dimension labels are given. Just wanted to drop this here, so it doesnt get lost.

@dcherian
Copy link
Contributor

dcherian commented Mar 5, 2021

See pydata/xarray#4979 (comment)

@github-actions
Copy link

This issue has been marked 'stale' due to lack of recent activity. If there is no further activity, the issue will be closed in another 30 days. Thank you for your contribution!

@github-actions github-actions bot added the Stale label Jun 11, 2021
@github-actions
Copy link

This issue has been closed due to inactivity. If you feel this is in error, please reopen the issue or file a new issue with the relevant details.

@jbusecke
Copy link
Contributor

I think this is still relevant. Actually this might be interesting for your research project @jdldeauna. I think you wanted to subset a global model in a certain region?

@jbusecke jbusecke reopened this Jun 24, 2021
@jbusecke jbusecke added keepOpen and removed Stale labels Jun 24, 2021
@jdldeauna
Copy link
Contributor

Yeah I do a lot of subsetting, so it would be great to integrate that with xgcm. :)

@jdldeauna
Copy link
Contributor

I think this would be a great feature to add to xgcm! I had a few questions:

  1. To clarify, this would be a method under the Grid class (i.e., usage would be grid.sel() and/or grid.isel())?
  2. Maybe we could also consider using lat/lon as inputs for defining subset limits (aside from axis indices)? But then make sure that the right lat/lon grids are being accessed. So for example, grid.sel(ds.uvelocity, lat=[25,40]) would automatically access lat_u, while tracers would access defaults (grid.sel(ds.temperature,lat=[25,40]) would use lat). I realize this would involve using a lot of xarray methods "under the hood" but I think its worth making it more intuitive for simple-case users (like me 😅 ).
  3. There is an available test for using GCMs (test_gcm_dataset.py) but it's marked as deprecated. Could it still be used, and is the GCMDataset still available? Alternatively, is it advisable to write tests which use intake to download GCM datasets for testing, or would that add complexity we wouldn't want?

@jbusecke
Copy link
Contributor

Hey @jdldeauna,

  1. Yes I think this should live in the grid object as a method
  2. I think it would be even more powerful if we can implement this on a full dataset (see example below). Regarding the lat/lon selection: I am not sure how to even implement a general lon/lat selection without a mask. On a regular lon/lat grid this is super easy, and you can figure out the indicies before passing them to grid.isel() but once you have curvilinear coordinates there have to be 'gaps' in your resulting data.

I envision the API to be something like this ds_new, grid_new = grid.isel({'X':slice(a,b), 'Y':slice(c,d)}). I think the default should always apply the selection based on the center position and select all other positions to result in an outer:

Lets say we have a setup like this along a single axis X

Tracer(center)
 |------o-------|------o-------|------o-------|------o-------|
       [0]            [1]            [2]            [3]

Velocity(left)
 |------o-------|------o-------|------o-------|------o-------|
[0]            [1]            [2]            [3]

I think grid.isel({'X':slice(1,2)}) should return:

Tracer(center)
 |------o-------|------o-------|------o-------|------o-------|
                      [1]            [2]            

Velocity(outer)
 |------o-------|------o-------|------o-------|------o-------|
               [1]            [2]            [3]

This to me is the most intuitive default because it creates velocity cells 'surrounding' the tracer.

Of course this can be modified:
grid.isel({'X':slice(1,2)}, pos_mapping={'center':'center', 'left':'left'})

Would results in:

Tracer(center)
 |------o-------|------o-------|------o-------|------o-------|
                      [1]            [2]            

Velocity(left)
 |------o-------|------o-------|------o-------|------o-------|
               [1]            [2]          

These choices are then automatically parsed into the new grid object.

I think this would be a huuuge first step to accomplish. The lon/lat selection could then maybe build on top of that, but this needs to work first IMO.

@jbusecke
Copy link
Contributor

There is an available test for using GCMs (test_gcm_dataset.py) but it's marked as deprecated. Could it still be used, and is the GCMDataset still available? Alternatively, is it advisable to write tests which use intake to download GCM datasets for testing, or would that add complexity we wouldn't want?

This might be worth raising another issue? just to keep the discussion here focussed?

@dcherian
Copy link
Contributor

I agree. let's do isel first. sel can then call isel.

pos_mapping={'center':'center', 'left':'left'}

I feel like this should be the default. i.e. preserve the positions. For your first case, shouldn't this be {"X": {"left": "outer"}} i.e. we should have to specify this per-axis and positions that don't change need not be specified.

@jdldeauna
Copy link
Contributor

Thanks for laying this out Julius! And Deepak, would specifying that apply to both velocity and tracer?

@dcherian
Copy link
Contributor

I read it as saying: take any variable currently on x: left and make it x: outer on the returned grid and dataset.

@jbusecke
Copy link
Contributor

jbusecke commented Aug 2, 2021

I think that is right. We just want to operate based on grid position. The variable can be whatever is located on that grid position, it should not matter for this feature.

@jdldeauna
Copy link
Contributor

Hi! I'm trying using isel with a test grid object:

from xgcm import Grid
from xgcm.test.datasets import datasets_grid_metric

ds, coords, metrics = datasets_grid_metric("C")
grid = Grid(ds, coords=coords, metrics=metrics)
test = grid._ds.tracer.isel(xt=slice(1,2))

print(ds.tracer.coords['xt'])
print(test.coords['xt'])

Output:

Coordinates:
  * xt       (xt) int64 0 1 2 3
Coordinates:
  * xt       (xt) int64 1

To confirm, ideally this is what the grid.isel method should be able do:

grid_new = grid.isel({'X':slice(1,2)})
print(grid_new._ds.tracer.coords['xt'])
Coordinates:
  * xt       (xt) int64 1 2 # should include outer value?

So how can the xgcm standard positions ({"X": {"left": "outer"}}) be combined with the xarray isel and the slice functions when subsetting variables?

@jbusecke
Copy link
Contributor

jbusecke commented Sep 7, 2021

Hey @jdldeauna, I believe that this is the expected behavior when passing a slice object, and I would be hesitant to change this. I think you could pass a list of indicies instead ...isel(xt=[1,2]) to achieve what you want here.

For the more general question So how can the xgcm standard positions ({"X": {"left": "outer"}}) be combined with the xarray isel and the slice functions when subsetting variables? We should consider all the possible inputs for .isel in xarray. From the xarray docs we get:

A dict with keys matching dimensions and values given by integers, slice objects or arrays. indexer can be a integer, slice, array-like or DataArray.

As a start, I would try to come up with some experimental logic to convert all of these inputs to the desired output. E.g. some function like this (pseudo-code):

def get_grid_indexer(grid, ds, axis_idx_dict, reference_position='center'):
    """ grid is just a placeholder for self later, ds is an input dataset, axis_idx_dict is something like you outlined above {'X': slice(1,2), 'Y':4}
    and reference_position is either a str or dict (per axis) that tells this function which dimension should be used for the initial selection (xt in your above example)"""
    for axis, indexer in axis_idx_dict.items():
         # check if axis in grid object
         # figure out all the dimensions and grid positions in `ds` that are associated with this axis
         dimensions, grid_positions = ....()
         # loop over all positions and apply appropriate selection
         for di, grid_position in zip(dimensions, grid_positions):
               if grid_position == reference position:
                     # simplest case, just apply the selection as is
                     ds = ds.isel({dim:indexer}
               else:
                     # Implement a 'translation' logic to convert `indexer` to other grid positions (needs to cover all possible combinations)
                     if reference_position == 'center' and grid_position == 'outer':
                          modified_indexer = ... # if you pass [1, 2] as the initial indexer you want to get something like [0,1,2] out here
                          # apply the selection
                          ds = ds.isel({dim: modified_indexer})

You will have to see how to make this work with all the inputs from above. Maybe xarrays internal logic has some things we could adapt. I am personally not very familiar with the indexing logic of xarray, but maybe @dcherian knows more and has some ideas how to make this easier (this is a pretty brute force approach but should work IMO).

As a sidenote, I think we might want to disallow indexing with Dataarrays for now, and keep that for later?

@jdldeauna
Copy link
Contributor

I see, thanks for the pseudo-code Julius!

As a sidenote, I think we might want to disallow indexing with Dataarrays for now, and keep that for later?

Sure! I was trying it out as part of testing code.

@dcherian
Copy link
Contributor

Here is some related work: https://xarray-subset-grid.readthedocs.io/en/latest/index.html

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants