Skip to content

Commit 4cb8ddd

Browse files
brokkoli71normanrzdstansby
authored
Add default compressors to config (#2470)
* add default compressor to config * modify _default_compressor to _default_filters_and_compressor * fix test_metadata_to_dict * wip debugging * format * fix v2 decode string dtype * fix config default tests * format * Update src/zarr/codecs/_v2.py * rename v2_dtype_kind_to_default_filters_and_compressor to v2_default_compressors * recover test_v2.py * incorporate feedback * incorporate feedback * fix mypy * allow only one default compressor * put `v2_default_compressor` under `array` * deprecate zarr.storage.default_compressor * test v3_default_codecs * use v3_default_codecs * fix tests that expected codecs==["bytes"] * fix test_default_codecs * fail-fast: false * fix string codecs for np1.25 * format * add docstrings to create in asynchronous.py and array.py * add docstrings to creation in group.py * Apply suggestions from code review Co-authored-by: David Stansby <dstansby@gmail.com> * apply suggestions from review * correct code double backticks * correct attribute links in docstring * link zarr.core.config in docstrings * improve docstring readability * correct config docstring * correct config docstring * improve config docstring --------- Co-authored-by: Norman Rzepka <code@normanrz.com> Co-authored-by: David Stansby <dstansby@gmail.com>
1 parent 1cc3917 commit 4cb8ddd

File tree

14 files changed

+529
-150
lines changed

14 files changed

+529
-150
lines changed

src/zarr/api/asynchronous.py

Lines changed: 48 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -17,10 +17,12 @@
1717
ChunkCoords,
1818
MemoryOrder,
1919
ZarrFormat,
20+
parse_dtype,
2021
)
2122
from zarr.core.config import config
2223
from zarr.core.group import AsyncGroup, ConsolidatedMetadata, GroupMetadata
2324
from zarr.core.metadata import ArrayMetadataDict, ArrayV2Metadata, ArrayV3Metadata
25+
from zarr.core.metadata.v2 import _default_filters_and_compressor
2426
from zarr.errors import NodeTypeValidationError
2527
from zarr.storage import (
2628
StoreLike,
@@ -401,7 +403,7 @@ async def save_array(
401403
arr : ndarray
402404
NumPy array with data to save.
403405
zarr_format : {2, 3, None}, optional
404-
The zarr format to use when saving.
406+
The zarr format to use when saving (default is 3 if not specified).
405407
path : str or None, optional
406408
The path within the store where the array will be saved.
407409
storage_options : dict
@@ -817,19 +819,45 @@ async def create(
817819
shape : int or tuple of ints
818820
Array shape.
819821
chunks : int or tuple of ints, optional
820-
Chunk shape. If True, will be guessed from `shape` and `dtype`. If
821-
False, will be set to `shape`, i.e., single chunk for the whole array.
822-
If an int, the chunk size in each dimension will be given by the value
823-
of `chunks`. Default is True.
822+
The shape of the array's chunks.
823+
V2 only. V3 arrays should use `chunk_shape` instead.
824+
If not specified, default values are guessed based on the shape and dtype.
824825
dtype : str or dtype, optional
825826
NumPy dtype.
827+
chunk_shape : int or tuple of ints, optional
828+
The shape of the Array's chunks (default is None).
829+
V3 only. V2 arrays should use `chunks` instead.
830+
chunk_key_encoding : ChunkKeyEncoding, optional
831+
A specification of how the chunk keys are represented in storage.
832+
V3 only. V2 arrays should use `dimension_separator` instead.
833+
Default is ``("default", "/")``.
834+
codecs : Sequence of Codecs or dicts, optional
835+
An iterable of Codec or dict serializations of Codecs. The elements of
836+
this collection specify the transformation from array values to stored bytes.
837+
V3 only. V2 arrays should use ``filters`` and ``compressor`` instead.
838+
839+
If no codecs are provided, default codecs will be used:
840+
841+
- For numeric arrays, the default is ``BytesCodec`` and ``ZstdCodec``.
842+
- For Unicode strings, the default is ``VLenUTF8Codec``.
843+
- For bytes or objects, the default is ``VLenBytesCodec``.
844+
845+
These defaults can be changed by modifying the value of ``array.v3_default_codecs`` in :mod:`zarr.core.config`.
826846
compressor : Codec, optional
827-
Primary compressor.
828-
fill_value : object
847+
Primary compressor to compress chunk data.
848+
V2 only. V3 arrays should use ``codecs`` instead.
849+
850+
If neither ``compressor`` nor ``filters`` are provided, a default compressor will be used:
851+
852+
- For numeric arrays, the default is ``ZstdCodec``.
853+
- For Unicode strings, the default is ``VLenUTF8Codec``.
854+
- For bytes or objects, the default is ``VLenBytesCodec``.
855+
856+
These defaults can be changed by modifying the value of ``array.v2_default_compressor`` in :mod:`zarr.core.config`. fill_value : object
829857
Default value to use for uninitialized portions of the array.
830858
order : {'C', 'F'}, optional
831859
Memory layout to be used within each chunk.
832-
Default is set in Zarr's config (`array.order`).
860+
If not specified, default is taken from the Zarr config ```array.order```.
833861
store : Store or str
834862
Store or path to directory in file system or name of zip file.
835863
synchronizer : object, optional
@@ -844,6 +872,8 @@ async def create(
844872
for storage of both chunks and metadata.
845873
filters : sequence of Codecs, optional
846874
Sequence of filters to use to encode chunk data prior to compression.
875+
V2 only. If neither ``compressor`` nor ``filters`` are provided, a default
876+
compressor will be used. (see ``compressor`` for details).
847877
cache_metadata : bool, optional
848878
If True, array configuration metadata will be cached for the
849879
lifetime of the object. If False, array metadata will be reloaded
@@ -859,7 +889,8 @@ async def create(
859889
A codec to encode object arrays, only needed if dtype=object.
860890
dimension_separator : {'.', '/'}, optional
861891
Separator placed between the dimensions of a chunk.
862-
892+
V2 only. V3 arrays should use ``chunk_key_encoding`` instead.
893+
Default is ".".
863894
.. versionadded:: 2.8
864895
865896
write_empty_chunks : bool, optional
@@ -875,6 +906,7 @@ async def create(
875906
876907
zarr_format : {2, 3, None}, optional
877908
The zarr format to use when saving.
909+
Default is 3.
878910
meta_array : array-like, optional
879911
An array instance to use for determining arrays to create and return
880912
to users. Use `numpy.empty(())` by default.
@@ -894,9 +926,13 @@ async def create(
894926
or _default_zarr_version()
895927
)
896928

897-
if zarr_format == 2 and chunks is None:
898-
chunks = shape
899-
elif zarr_format == 3 and chunk_shape is None:
929+
if zarr_format == 2:
930+
if chunks is None:
931+
chunks = shape
932+
dtype = parse_dtype(dtype, zarr_format)
933+
if not filters and not compressor:
934+
filters, compressor = _default_filters_and_compressor(dtype)
935+
elif zarr_format == 3 and chunk_shape is None: # type: ignore[redundant-expr]
900936
if chunks is not None:
901937
chunk_shape = chunks
902938
chunks = None

src/zarr/codecs/__init__.py

Lines changed: 0 additions & 18 deletions
Original file line numberDiff line numberDiff line change
@@ -1,10 +1,5 @@
11
from __future__ import annotations
22

3-
from typing import TYPE_CHECKING, Any
4-
5-
if TYPE_CHECKING:
6-
import numpy as np
7-
83
from zarr.codecs.blosc import BloscCname, BloscCodec, BloscShuffle
94
from zarr.codecs.bytes import BytesCodec, Endian
105
from zarr.codecs.crc32c_ import Crc32cCodec
@@ -13,7 +8,6 @@
138
from zarr.codecs.transpose import TransposeCodec
149
from zarr.codecs.vlen_utf8 import VLenBytesCodec, VLenUTF8Codec
1510
from zarr.codecs.zstd import ZstdCodec
16-
from zarr.core.metadata.v3 import DataType
1711

1812
__all__ = [
1913
"BloscCname",
@@ -30,15 +24,3 @@
3024
"VLenUTF8Codec",
3125
"ZstdCodec",
3226
]
33-
34-
35-
def _get_default_array_bytes_codec(
36-
np_dtype: np.dtype[Any],
37-
) -> BytesCodec | VLenUTF8Codec | VLenBytesCodec:
38-
dtype = DataType.from_numpy(np_dtype)
39-
if dtype == DataType.string:
40-
return VLenUTF8Codec()
41-
elif dtype == DataType.bytes:
42-
return VLenBytesCodec()
43-
else:
44-
return BytesCodec()

src/zarr/codecs/_v2.py

Lines changed: 12 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -5,6 +5,7 @@
55
from typing import TYPE_CHECKING
66

77
import numcodecs
8+
import numpy as np
89
from numcodecs.compat import ensure_bytes, ensure_ndarray_like
910

1011
from zarr.abc.codec import ArrayBytesCodec
@@ -46,7 +47,17 @@ async def _decode_single(
4647
# special case object dtype, because incorrect handling can lead to
4748
# segfaults and other bad things happening
4849
if chunk_spec.dtype != object:
49-
chunk = chunk.view(chunk_spec.dtype)
50+
try:
51+
chunk = chunk.view(chunk_spec.dtype)
52+
except TypeError:
53+
# this will happen if the dtype of the chunk
54+
# does not match the dtype of the array spec i.g. if
55+
# the dtype of the chunk_spec is a string dtype, but the chunk
56+
# is an object array. In this case, we need to convert the object
57+
# array to the correct dtype.
58+
59+
chunk = np.array(chunk).astype(chunk_spec.dtype)
60+
5061
elif chunk.dtype != object:
5162
# If we end up here, someone must have hacked around with the filters.
5263
# We cannot deal with object arrays unless there is an object

src/zarr/core/array.py

Lines changed: 103 additions & 36 deletions
Original file line numberDiff line numberDiff line change
@@ -13,7 +13,6 @@
1313

1414
from zarr._compat import _deprecate_positional_args
1515
from zarr.abc.store import Store, set_or_delete
16-
from zarr.codecs import _get_default_array_bytes_codec
1716
from zarr.codecs._v2 import V2Codec
1817
from zarr.core._info import ArrayInfo
1918
from zarr.core.attributes import Attributes
@@ -78,7 +77,8 @@
7877
ArrayV3MetadataDict,
7978
T_ArrayMetadata,
8079
)
81-
from zarr.core.metadata.v3 import parse_node_type_array
80+
from zarr.core.metadata.v2 import _default_filters_and_compressor
81+
from zarr.core.metadata.v3 import DataType, parse_node_type_array
8282
from zarr.core.sync import sync
8383
from zarr.errors import MetadataValidationError
8484
from zarr.registry import get_pipeline_class
@@ -409,27 +409,53 @@ async def create(
409409
attributes : dict[str, JSON], optional
410410
The attributes of the array (default is None).
411411
chunk_shape : ChunkCoords, optional
412-
The shape of the array's chunks (default is None).
412+
The shape of the array's chunks
413+
V3 only. V2 arrays should use `chunks` instead.
414+
If not specified, default are guessed based on the shape and dtype.
413415
chunk_key_encoding : ChunkKeyEncoding, optional
414-
The chunk key encoding (default is None).
415-
codecs : Iterable[Codec | dict[str, JSON]], optional
416-
The codecs used to encode the data (default is None).
416+
A specification of how the chunk keys are represented in storage.
417+
V3 only. V2 arrays should use `dimension_separator` instead.
418+
Default is ``("default", "/")``.
419+
codecs : Sequence of Codecs or dicts, optional
420+
An iterable of Codec or dict serializations of Codecs. The elements of
421+
this collection specify the transformation from array values to stored bytes.
422+
V3 only. V2 arrays should use ``filters`` and ``compressor`` instead.
423+
424+
If no codecs are provided, default codecs will be used:
425+
426+
- For numeric arrays, the default is ``BytesCodec`` and ``ZstdCodec``.
427+
- For Unicode strings, the default is ``VLenUTF8Codec``.
428+
- For bytes or objects, the default is ``VLenBytesCodec``.
429+
430+
These defaults can be changed by modifying the value of ``array.v3_default_codecs`` in :mod:`zarr.core.config`.
417431
dimension_names : Iterable[str], optional
418432
The names of the dimensions (default is None).
433+
V3 only. V2 arrays should not use this parameter.
419434
chunks : ShapeLike, optional
420-
The shape of the array's chunks (default is None).
421-
V2 only. V3 arrays should not have 'chunks' parameter.
435+
The shape of the array's chunks.
436+
V2 only. V3 arrays should use ``chunk_shape`` instead.
437+
If not specified, default are guessed based on the shape and dtype.
422438
dimension_separator : Literal[".", "/"], optional
423-
The dimension separator (default is None).
424-
V2 only. V3 arrays cannot have a dimension separator.
439+
The dimension separator (default is ".").
440+
V2 only. V3 arrays should use ``chunk_key_encoding`` instead.
425441
order : Literal["C", "F"], optional
426-
The order of the array (default is None).
442+
The order of the array (default is specified by ``array.order`` in :mod:`zarr.core.config`).
427443
filters : list[dict[str, JSON]], optional
428-
The filters used to compress the data (default is None).
429-
V2 only. V3 arrays should not have 'filters' parameter.
444+
Sequence of filters to use to encode chunk data prior to compression.
445+
V2 only. V3 arrays should use ``codecs`` instead. If neither ``compressor``
446+
nor ``filters`` are provided, a default compressor will be used. (see
447+
``compressor`` for details)
430448
compressor : dict[str, JSON], optional
431449
The compressor used to compress the data (default is None).
432-
V2 only. V3 arrays should not have 'compressor' parameter.
450+
V2 only. V3 arrays should use ``codecs`` instead.
451+
452+
If neither ``compressor`` nor ``filters`` are provided, a default compressor will be used:
453+
454+
- For numeric arrays, the default is ``ZstdCodec``.
455+
- For Unicode strings, the default is ``VLenUTF8Codec``.
456+
- For bytes or objects, the default is ``VLenBytesCodec``.
457+
458+
These defaults can be changed by modifying the value of ``array.v2_default_compressor`` in :mod:`zarr.core.config`.
433459
overwrite : bool, optional
434460
Whether to raise an error if the store already exists (default is False).
435461
data : npt.ArrayLike, optional
@@ -494,14 +520,6 @@ async def create(
494520
order=order,
495521
)
496522
elif zarr_format == 2:
497-
if dtype is str or dtype == "str":
498-
# another special case: zarr v2 added the vlen-utf8 codec
499-
vlen_codec: dict[str, JSON] = {"id": "vlen-utf8"}
500-
if filters and not any(x["id"] == "vlen-utf8" for x in filters):
501-
filters = list(filters) + [vlen_codec]
502-
else:
503-
filters = [vlen_codec]
504-
505523
if codecs is not None:
506524
raise ValueError(
507525
"codecs cannot be used for arrays with version 2. Use filters and compressor instead."
@@ -564,11 +582,7 @@ async def _create_v3(
564582
await ensure_no_existing_node(store_path, zarr_format=3)
565583

566584
shape = parse_shapelike(shape)
567-
codecs = (
568-
list(codecs)
569-
if codecs is not None
570-
else [_get_default_array_bytes_codec(np.dtype(dtype))]
571-
)
585+
codecs = list(codecs) if codecs is not None else _get_default_codecs(np.dtype(dtype))
572586

573587
if chunk_key_encoding is None:
574588
chunk_key_encoding = ("default", "/")
@@ -634,6 +648,14 @@ async def _create_v2(
634648
if dimension_separator is None:
635649
dimension_separator = "."
636650

651+
dtype = parse_dtype(dtype, zarr_format=2)
652+
if not filters and not compressor:
653+
filters, compressor = _default_filters_and_compressor(dtype)
654+
if np.issubdtype(dtype, np.str_):
655+
filters = filters or []
656+
if not any(x["id"] == "vlen-utf8" for x in filters):
657+
filters = list(filters) + [{"id": "vlen-utf8"}]
658+
637659
metadata = ArrayV2Metadata(
638660
shape=shape,
639661
dtype=np.dtype(dtype),
@@ -1493,23 +1515,53 @@ def create(
14931515
dtype : npt.DTypeLike
14941516
The data type of the array.
14951517
chunk_shape : ChunkCoords, optional
1496-
The shape of the Array's chunks (default is None).
1518+
The shape of the Array's chunks.
1519+
V3 only. V2 arrays should use `chunks` instead.
1520+
If not specified, default are guessed based on the shape and dtype.
14971521
chunk_key_encoding : ChunkKeyEncoding, optional
1498-
The chunk key encoding (default is None).
1499-
codecs : Iterable[Codec | dict[str, JSON]], optional
1500-
The codecs used to encode the data (default is None).
1522+
A specification of how the chunk keys are represented in storage.
1523+
V3 only. V2 arrays should use `dimension_separator` instead.
1524+
Default is ``("default", "/")``.
1525+
codecs : Sequence of Codecs or dicts, optional
1526+
An iterable of Codec or dict serializations of Codecs. The elements of
1527+
this collection specify the transformation from array values to stored bytes.
1528+
V3 only. V2 arrays should use ``filters`` and ``compressor`` instead.
1529+
1530+
If no codecs are provided, default codecs will be used:
1531+
1532+
- For numeric arrays, the default is ``BytesCodec`` and ``ZstdCodec``.
1533+
- For Unicode strings, the default is ``VLenUTF8Codec``.
1534+
- For bytes or objects, the default is ``VLenBytesCodec``.
1535+
1536+
These defaults can be changed by modifying the value of ``array.v3_default_codecs`` in :mod:`zarr.core.config`.
15011537
dimension_names : Iterable[str], optional
15021538
The names of the dimensions (default is None).
1539+
V3 only. V2 arrays should not use this parameter.
15031540
chunks : ChunkCoords, optional
1504-
The shape of the Array's chunks (default is None).
1541+
The shape of the array's chunks.
1542+
V2 only. V3 arrays should use ``chunk_shape`` instead.
1543+
If not specified, default are guessed based on the shape and dtype.
15051544
dimension_separator : Literal[".", "/"], optional
1506-
The dimension separator (default is None).
1545+
The dimension separator (default is ".").
1546+
V2 only. V3 arrays should use ``chunk_key_encoding`` instead.
15071547
order : Literal["C", "F"], optional
1508-
The order of the array (default is None).
1548+
The order of the array (default is specified by ``array.order`` in :mod:`zarr.core.config`).
15091549
filters : list[dict[str, JSON]], optional
1510-
The filters used to compress the data (default is None).
1550+
Sequence of filters to use to encode chunk data prior to compression.
1551+
V2 only. V3 arrays should use ``codecs`` instead. If neither ``compressor``
1552+
nor ``filters`` are provided, a default compressor will be used. (see
1553+
``compressor`` for details)
15111554
compressor : dict[str, JSON], optional
1512-
The compressor used to compress the data (default is None).
1555+
Primary compressor to compress chunk data.
1556+
V2 only. V3 arrays should use ``codecs`` instead.
1557+
1558+
If neither ``compressor`` nor ``filters`` are provided, a default compressor will be used:
1559+
1560+
- For numeric arrays, the default is ``ZstdCodec``.
1561+
- For Unicode strings, the default is ``VLenUTF8Codec``.
1562+
- For bytes or objects, the default is ``VLenBytesCodec``.
1563+
1564+
These defaults can be changed by modifying the value of ``array.v2_default_compressor`` in :mod:`zarr.core.config`.
15131565
overwrite : bool, optional
15141566
Whether to raise an error if the store already exists (default is False).
15151567
@@ -3342,3 +3394,18 @@ def _build_parents(
33423394
)
33433395

33443396
return parents
3397+
3398+
3399+
def _get_default_codecs(
3400+
np_dtype: np.dtype[Any],
3401+
) -> list[dict[str, JSON]]:
3402+
default_codecs = config.get("array.v3_default_codecs")
3403+
dtype = DataType.from_numpy(np_dtype)
3404+
if dtype == DataType.string:
3405+
dtype_key = "string"
3406+
elif dtype == DataType.bytes:
3407+
dtype_key = "bytes"
3408+
else:
3409+
dtype_key = "numeric"
3410+
3411+
return [{"name": codec_id, "configuration": {}} for codec_id in default_codecs[dtype_key]]

0 commit comments

Comments
 (0)