Skip to content

apply_ufunc(dask='parallelized') output_dtypes for datasets #1699

Open
@crusaderky

Description

@crusaderky

When a Dataset has variables with different dtypes, there's no way to tell apply_ufunc that the same function applied to different variables will produce different dtypes:

ds1 = xarray.Dataset(data_vars={'a': ('x', [1, 2]), 'b': ('x', [3.0, 4.5])}).chunk()
ds2 = xarray.apply_ufunc(lambda x: x + 1, ds1, dask='parallelized', output_dtypes=[float])
ds2

<xarray.Dataset>
Dimensions:  (x: 2)
Dimensions without coordinates: x
Data variables:
    a        (x) float64 dask.array<shape=(2,), chunksize=(2,)>
    b        (x) float64 dask.array<shape=(2,), chunksize=(2,)>

ds2.compute()

<xarray.Dataset>
Dimensions:  (x: 2)
Dimensions without coordinates: x
Data variables:
    a        (x) int64 2 3
    b        (x) float64 4.0 5.5

Proposed solution

When the output is a dataset, apply_ufunc could accept either output_dtypes=[t] (if all output variables will have the same dtype) or output_dtypes=[{var1: t1, var2: t2, ...}]. In the example above, it would be output_dtypes=[{'a': int, 'b': float}].

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions