best way to write a zarr store with a single chunk #6697

raybellwaves · 2022-06-14T02:51:07Z

raybellwaves
Jun 14, 2022

I'm working with some small files and I would like to set the chunk size as 1 on the zarr stores. Motivation here is reduce overhead of GET and PUT calls. I'm also working on a large EC2 machine therefore I can afford to read in more data at once.

I like the zarr format as it makes IO in the cloud easy.

The default behavior of to_zarr will chunk the array. For example chunking the tutorial dataset will create 16 objects in ds.zarr/air:

0.0.0 0.1.0 1.0.0 1.1.0 2.0.0 2.1.0 3.0.0 3.1.0
0.0.1 0.1.1 1.0.1 1.1.1 2.0.1 2.1.1 3.0.1 3.1.1

ds = xr.tutorial.open_dataset("air_temperature")
ds.to_zarr("test.zarr", consolidated=True)

I currently create a zarr store with a single chunk using .chunk to set the chunk size as the same length as the dimensions.

ds = xr.tutorial.open_dataset("air_temperature")
chunks = {}
for dim in ds.dims:
    chunks[dim] = len(ds[dim])
    ds = ds.chunk(chunks=chunks)
ds.to_zarr("test2.zarr", consolidated=True)

I can't think of a better alternative nor how I could specify this better in .chunk. chunks=1 is chunk size of 1 therefore number of chunks is len(dim) * 1. chunks=0 gives ZeroDivisionError

Answered by andersy005

Jun 14, 2022

@raybellwaves, does the following accomplish what you are looking for?

In [22]: chunks = dict.fromkeys(ds.dims, -1)

In [23]: chunks
Out[23]: {'lat': -1, 'time': -1, 'lon': -1}

In [25]: ds.chunk(chunks).to_zarr("/tmp/test2.zarr", consolidated=True)

In [31]: !tree /tmp/test2.zarr
/tmp/test2.zarr
├── air
│   └── 0.0.0
├── lat
│   └── 0
├── lon
│   └── 0
└── time
    └── 0

View full answer

andersy005 · 2022-06-14T04:02:49Z

andersy005
Jun 14, 2022
Maintainer

@raybellwaves, does the following accomplish what you are looking for?

In [22]: chunks = dict.fromkeys(ds.dims, -1)

In [23]: chunks
Out[23]: {'lat': -1, 'time': -1, 'lon': -1}

In [25]: ds.chunk(chunks).to_zarr("/tmp/test2.zarr", consolidated=True)

In [31]: !tree /tmp/test2.zarr
/tmp/test2.zarr
├── air
│   └── 0.0.0
├── lat
│   └── 0
├── lon
│   └── 0
└── time
    └── 0

2 replies

shoyer Jun 14, 2022
Maintainer

ds.chunk(-1).to_zarr("/tmp/test2.zarr", consolidated=True) should also work!

raybellwaves Jun 15, 2022
Author

I never got round to checking if chunks can take -1. This is perfect. Thanks @andersy005, @rabernat, @shoyer

rabernat · 2022-06-14T11:44:59Z

rabernat
Jun 14, 2022
Maintainer

I would definitely go with @andersy005's answer above! However, in the scenario where you want to avoid Dask entirely, you can also use .encoding to accomplish this.

ds = xr.tutorial.open_dataset("air_temperature")
for var in ds.variables:
    ds[var].encoding['chunks'] = ds[var].shape
ds.to_zarr("test2.zarr", consolidated=True)
!tree test2.zarr

test2.zarr
├── air
│   └── 0.0.0
├── lat
│   └── 0
├── lon
│   └── 0
└── time
    └── 0

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

best way to write a zarr store with a single chunk #6697

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 2 comments 2 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Uh oh!

best way to write a zarr store with a single chunk #6697

Uh oh!

Uh oh!

raybellwaves Jun 14, 2022

Replies: 2 comments · 2 replies

Uh oh!

andersy005 Jun 14, 2022 Maintainer

Uh oh!

shoyer Jun 14, 2022 Maintainer

Uh oh!

raybellwaves Jun 15, 2022 Author

Uh oh!

rabernat Jun 14, 2022 Maintainer

raybellwaves
Jun 14, 2022

Replies: 2 comments 2 replies

andersy005
Jun 14, 2022
Maintainer

shoyer Jun 14, 2022
Maintainer

raybellwaves Jun 15, 2022
Author

rabernat
Jun 14, 2022
Maintainer