Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BackendEntrypoint bug chunking data #8810

Open
5 tasks
timvgl opened this issue Mar 6, 2024 · 7 comments
Open
5 tasks

BackendEntrypoint bug chunking data #8810

timvgl opened this issue Mar 6, 2024 · 7 comments

Comments

@timvgl
Copy link

timvgl commented Mar 6, 2024

What happened?

Hi together,
I have written a class for the BackendEntrypoint. Since the data can be quite large, I wrote huge parts by making use of the dask.delayed function. As one of the last steps I am redefining the chunks. The code is loading single files and concatenates them into a xarray. As a result, it makes sense to chunk for each single file. This chunking is not forwarded to the main program, where the class is used:
dset = dset.chunk(dict(zip(dims, list(reversed([list(reversed(list(shape)))[i] if i < 4 else 1 for i in range(len(list(shape)))])))))
return dset
https://github.com/timvgl/mumaxXR/blob/daa4345b482197b65a7bd88df1552d9fa48b8bf7/src/mumaxXR/OvfEngine.py#L356-L357
Between dset = .... and return dset dset.chunksizes gives the expcted values. However, in the main program, the chunking is not available anymore and dset.chunksizes gives Frozen({}).
Thanks for your help!
Tim

What did you expect to happen?

No response

Minimal Complete Verifiable Example

No response

MVCE confirmation

  • Minimal example — the example is as focused as reasonably possible to demonstrate the underlying issue in xarray.
  • Complete example — the example is self-contained, including all data and the text of any traceback.
  • Verifiable example — the example copy & pastes into an IPython prompt or Binder notebook, returning the result.
  • New issue — a search of GitHub Issues suggests this is not a duplicate.
  • Recent environment — the issue occurs with the latest version of xarray and its dependencies.

Relevant log output

No response

Anything else we need to know?

No response

Environment

xarray version: 2023.1.0
Linux

@timvgl timvgl added bug needs triage Issue that has not been reviewed by xarray team member labels Mar 6, 2024
Copy link

welcome bot commented Mar 6, 2024

Thanks for opening your first issue here at xarray! Be sure to follow the issue template!
If you have an idea for a solution, we would really welcome a Pull Request with proposed changes.
See the Contributing Guide for more.
It may take us a while to respond here, but we really value your contribution. Contributors like you help make xarray better.
Thank you!

@TomNicholas TomNicholas added topic-backends and removed needs triage Issue that has not been reviewed by xarray team member labels Mar 6, 2024
@max-sixty
Copy link
Collaborator

Thanks for the issue, but we really need a more focused example to make progress here

@max-sixty max-sixty added the plan to close May be closeable, needs more eyeballs label Mar 6, 2024
@keewis
Copy link
Collaborator

keewis commented Mar 6, 2024

I agree that debugging the somewhat complex piece of code you linked to is a bit much to ask for an issue (especially given that we're already somewhat behind in answering some issues).

However, glancing over the code, I would point you to the preferred chunksizes section in the docs, in particular this sentence:

Note that the backend is not directly involved in Dask chunking, because Xarray internally manages chunking.

In other words, your backend is supposed to use the lazy indexing classes mentioned in that document, and you will get the dask handling for free. If that doesn't work for you (for whatever reason), remember that having a function separate from xarray's backend machinery is totally fine, as well.

@timvgl
Copy link
Author

timvgl commented Mar 6, 2024

Of course, debugging that code is not what I wanted you to do - you guys have different things to do. Thanks for the hint with the lazy indexing classes. I am gonna check that out. I wrote a shorter example to show that this is a more general problem.
I am not sure whether chunking within the backend making use of the chunking mechanism provided by xarray is supposed to be supported.

import xarray as xr
class TestEngine(xr.backends.BackendEntrypoint):
    def open_dataset(
        self,
        filename_or_obj,
        *,
        drop_variables=None,
    ):
        dataset = xr.open_dataset(filename_or_obj)
        chunkDict = {}
        for dim in dataset.dims:
            chunkDict[dim] = 1
        dataset = dataset.chunk(chunkDict)
        print(dataset.chunksizes)
        return dataset

dataset = xr.open_dataset(path_nc_file, engine=TestEngine)
print(dataset.chunksizes)

In this case, chunking from xarray in the backend is not forwarded.
This is a decision you just have to make.
If this is not wanted, this issue can be closed from my side.

@keewis
Copy link
Collaborator

keewis commented Mar 6, 2024

thanks for the minimal example. That reduces this issue to just "Should we allow backends to return chunked datasets?", which is much easier to discuss.

Back when we introduced the backends the current behavior was intentional, but that doesn't mean we can't change it (we'll first have to figure out whether that's a good idea, though).

@keewis keewis removed bug plan to close May be closeable, needs more eyeballs labels Mar 6, 2024
@Illviljan
Copy link
Contributor

The code is loading single files and concatenates them into a xarray.

Why don't you use open_mfdataset instead? That's xarrays recommended way of opening multiple files.

There's a nice tutorial of how to set up lazy handling of backends here:
https://tutorial.xarray.dev/advanced/backends/2.Backend_with_Lazy_Loading.html

@timvgl
Copy link
Author

timvgl commented Mar 18, 2024

Yeah I have updated the code already making use of the lazy loading structure. I did use the mf function in the past already. It is just more conviniend if the user doesn't have to define args that are the same always (like parallel=True, concat_dim='t' etc.). There are also some other reasons for using the original function instead of mf, but this would go to far.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants