Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Why can't map_blocks() just infer the coordinate values of the output Dataset/DataArray, instead of requiring them in template ? #9635

Open
v-morello opened this issue Oct 16, 2024 · 3 comments

Comments

@v-morello
Copy link

What is your issue?

Hello,

I am not sure whether to treat this as a bug or as a feature request yet. The title says it all, and I've attached a minimum example below. Let's say I want to average some data by blocks, but also average its attached time coordinate:

import numpy as np
import xarray
import dask.array as da

xds = xarray.Dataset(
    data_vars={"data": (("time",), np.linspace(0.0, 1.0, 10, endpoint=False))},
    coords={"time": np.arange(10)},
)
xds = xds.chunk({"time": 5})  # want to process two separate time intervals


def my_average(dataset):
    mean_time = dataset.time.mean()
    return dataset.mean(dim="time", keepdims=True).assign_coords({"time": [mean_time]})


### Hoping that map_blocks fills the output time coordinate, error
template = xarray.Dataset(
    data_vars={
        "data": (("time",), da.zeros(shape=(2,)))
    },
    coords={
        "time": da.zeros(shape=(2,)),
    },
)
template = template.chunk({"time": 1})  # align output chunks with input chunks

xarray.map_blocks(my_average, xds, template=template).compute()  # ValueError

The error is:

ValueError: Expected index 'time' to be PandasIndex(Index([0.0], dtype='float64', name='time')). Received PandasIndex(Index([7.0], dtype='float64', name='time')) instead.

Everything works as expected if I provide the correct time coordinate values in the template argument. In fact, the example shown in the function documentation does precisely this, but does not explain why it is necessary.

### Correct output time coordinate provided in advance, works
template = xarray.Dataset(
    data_vars={"data": (("time",), da.zeros(shape=(2,)))},
    coords={"time": [2.0, 7.0]},
)
template = template.chunk({"time": 1}) 

result = xarray.map_blocks(my_average, xds, template=template).compute()  # OK

I understand why "template" must specify the length of every output dimension and be chunked properly. However, is there a fundamental reason why coordinate values are treated differently from data variables here? I am wondering if that limitation could be removed in principle, because that would help our actual use case (and we might be able to help with a hypothetical fix).

@v-morello v-morello added the needs triage Issue that has not been reviewed by xarray team member label Oct 16, 2024
Copy link

welcome bot commented Oct 16, 2024

Thanks for opening your first issue here at xarray! Be sure to follow the issue template!
If you have an idea for a solution, we would really welcome a Pull Request with proposed changes.
See the Contributing Guide for more.
It may take us a while to respond here, but we really value your contribution. Contributors like you help make xarray better.
Thank you!

@dcherian dcherian added usage question and removed needs triage Issue that has not been reviewed by xarray team member labels Oct 22, 2024
@dcherian
Copy link
Contributor

Because to infer these values, we'd have to execute the whole computation. The model here is to stay lazy, which requires that the user specify a lot of information in the template.

More generally, since time is an "indexed coordinate" here, we must know its values eagerly at least for now.

What can we add to the docstring to make this clearer?

@v-morello
Copy link
Author

@dcherian Thanks for your reply!

The model here is to stay lazy.

OK, I get it now. If we require map_blocks() to be lazy, then for example .sel() still needs to work on the result without requiring execution of the map_blocks() call, and for that the coordinate value must already be present.

What can we add to the docstring to make this clearer?

I leave whether that is necessary to your better judgement (no one seems to have asked that question in the few years of existence of the map_blocks() function). Something along the lines of what I wrote just above would do the trick IMHO.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants