Skip to content

Separate defaults for filters, serializers and compressors in v3 #2653

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 11 commits into from
Jan 6, 2025
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
29 changes: 15 additions & 14 deletions docs/user-guide/config.rst
Original file line number Diff line number Diff line change
Expand Up @@ -28,7 +28,7 @@ Configuration options include the following:

- Default Zarr format ``default_zarr_version``
- Default array order in memory ``array.order``
- Default codecs ``array.v3_default_codecs`` and ``array.v2_default_compressor``
- Default filters, serializers and compressors, e.g. ``array.v3_default_filters``, ``array.v3_default_serializer``, ``array.v3_default_compressors``, ``array.v2_default_filters`` and ``array.v2_default_compressor``
- Whether empty chunks are written to storage ``array.write_empty_chunks``
- Async and threading options, e.g. ``async.concurrency`` and ``threading.max_workers``
- Selections of implementations of codecs, codec pipelines and buffers
Expand All @@ -54,19 +54,20 @@ This is the current default configuration::
'v2_default_filters': {'bytes': [{'id': 'vlen-bytes'}],
'numeric': None,
'string': [{'id': 'vlen-utf8'}]},
'v3_default_codecs': {'bytes': [{'name': 'vlen-bytes'},
{'configuration': {'checksum': False,
'level': 0},
'name': 'zstd'}],
'numeric': [{'configuration': {'endian': 'little'},
'name': 'bytes'},
{'configuration': {'checksum': False,
'level': 0},
'name': 'zstd'}],
'string': [{'name': 'vlen-utf8'},
{'configuration': {'checksum': False,
'level': 0},
'name': 'zstd'}]},
'v3_default_compressors': {'bytes': [{'configuration': {'checksum': False,
'level': 0},
'name': 'zstd'}],
'numeric': [{'configuration': {'checksum': False,
'level': 0},
'name': 'zstd'}],
'string': [{'configuration': {'checksum': False,
'level': 0},
'name': 'zstd'}]},
'v3_default_filters': {'bytes': [], 'numeric': [], 'string': []},
'v3_default_serializer': {'bytes': {'name': 'vlen-bytes'},
'numeric': {'configuration': {'endian': 'little'},
'name': 'bytes'},
'string': {'name': 'vlen-utf8'}},
'write_empty_chunks': False},
'async': {'concurrency': 10, 'timeout': None},
'buffer': 'zarr.core.buffer.cpu.Buffer',
Expand Down
3 changes: 2 additions & 1 deletion src/zarr/api/asynchronous.py
Original file line number Diff line number Diff line change
Expand Up @@ -892,7 +892,8 @@ async def create(
- For Unicode strings, the default is ``VLenUTF8Codec`` and ``ZstdCodec``.
- For bytes or objects, the default is ``VLenBytesCodec`` and ``ZstdCodec``.

These defaults can be changed by modifying the value of ``array.v3_default_codecs`` in :mod:`zarr.core.config`.
These defaults can be changed by modifying the value of ``array.v3_default_filters``,
``array.v3_default_serializer`` and ``array.v3_default_compressors`` in :mod:`zarr.core.config`.
compressor : Codec, optional
Primary compressor to compress chunk data.
Zarr format 2 only. Zarr format 3 arrays should use ``codecs`` instead.
Expand Down
17 changes: 8 additions & 9 deletions src/zarr/api/synchronous.py
Original file line number Diff line number Diff line change
Expand Up @@ -788,9 +788,8 @@ def create_array(
For Zarr format 3, a "filter" is a codec that takes an array and returns an array,
and these values must be instances of ``ArrayArrayCodec``, or dict representations
of ``ArrayArrayCodec``.
If ``filters`` and ``compressors`` are not specified, then the default codecs for
Zarr format 3 will be used.
These defaults can be changed by modifying the value of ``array.v3_default_codecs``
If no ``filters`` are provided, a default set of filters will be used.
These defaults can be changed by modifying the value of ``array.v3_default_filters``
in :mod:`zarr.core.config`.
Use ``None`` to omit default filters.

Expand All @@ -806,22 +805,22 @@ def create_array(

For Zarr format 3, a "compressor" is a codec that takes a bytestream, and
returns another bytestream. Multiple compressors my be provided for Zarr format 3.
If ``filters`` and ``compressors`` are not specified, then the default codecs for
Zarr format 3 will be used.
These defaults can be changed by modifying the value of ``array.v3_default_codecs``
If no ``compressors`` are provided, a default set of compressors will be used.
These defaults can be changed by modifying the value of ``array.v3_default_compressors``
in :mod:`zarr.core.config`.
Use ``None`` to omit default compressors.

For Zarr format 2, a "compressor" can be any numcodecs codec. Only a single compressor may
be provided for Zarr format 2.
If no ``compressors`` are provided, a default compressor will be used.
These defaults can be changed by modifying the value of ``array.v2_default_compressor``
If no ``compressor`` is provided, a default compressor will be used.
in :mod:`zarr.core.config`.
Use ``None`` to omit the default compressor.
serializer : dict[str, JSON] | ArrayBytesCodec, optional
Array-to-bytes codec to use for encoding the array data.
Zarr format 3 only. Zarr format 2 arrays use implicit array-to-bytes conversion.
If no ``serializer`` is provided, the `zarr.codecs.BytesCodec` codec will be used.
If no ``serializer`` is provided, a default serializer will be used.
These defaults can be changed by modifying the value of ``array.v3_default_serializer``
in :mod:`zarr.core.config`.
fill_value : Any, optional
Fill value for the array.
order : {"C", "F"}, optional
Expand Down
105 changes: 37 additions & 68 deletions src/zarr/core/array.py
Original file line number Diff line number Diff line change
Expand Up @@ -110,7 +110,6 @@
_parse_array_array_codec,
_parse_array_bytes_codec,
_parse_bytes_bytes_codec,
_resolve_codec,
get_pipeline_class,
)
from zarr.storage import StoreLike, make_store_path
Expand Down Expand Up @@ -469,7 +468,8 @@ async def create(
- For Unicode strings, the default is ``VLenUTF8Codec`` and ``ZstdCodec``.
- For bytes or objects, the default is ``VLenBytesCodec`` and ``ZstdCodec``.

These defaults can be changed by modifying the value of ``array.v3_default_codecs`` in :mod:`zarr.core.config`.
These defaults can be changed by modifying the value of ``array.v3_default_filters``,
``array.v3_default_serializer`` and ``array.v3_default_compressors`` in :mod:`zarr.core.config`.
dimension_names : Iterable[str], optional
The names of the dimensions (default is None).
Zarr format 3 only. Zarr format 2 arrays should not use this parameter.
Expand Down Expand Up @@ -1715,7 +1715,8 @@ def create(
- For Unicode strings, the default is ``VLenUTF8Codec`` and ``ZstdCodec``.
- For bytes or objects, the default is ``VLenBytesCodec`` and ``ZstdCodec``.

These defaults can be changed by modifying the value of ``array.v3_default_codecs`` in :mod:`zarr.core.config`.
These defaults can be changed by modifying the value of ``array.v3_default_filters``,
``array.v3_default_serializer`` and ``array.v3_default_compressors`` in :mod:`zarr.core.config`.
dimension_names : Iterable[str], optional
The names of the dimensions (default is None).
Zarr format 3 only. Zarr format 2 arrays should not use this parameter.
Expand Down Expand Up @@ -3698,17 +3699,9 @@ def _build_parents(

def _get_default_codecs(
np_dtype: np.dtype[Any],
) -> list[dict[str, JSON]]:
default_codecs = zarr_config.get("array.v3_default_codecs")
dtype = DataType.from_numpy(np_dtype)
if dtype == DataType.string:
dtype_key = "string"
elif dtype == DataType.bytes:
dtype_key = "bytes"
else:
dtype_key = "numeric"

return cast(list[dict[str, JSON]], default_codecs[dtype_key])
) -> tuple[Codec, ...]:
filters, serializer, compressors = _get_default_chunk_encoding_v3(np_dtype)
return filters + (serializer,) + compressors


FiltersLike: TypeAlias = (
Expand Down Expand Up @@ -3785,9 +3778,8 @@ async def create_array(
For Zarr format 3, a "filter" is a codec that takes an array and returns an array,
and these values must be instances of ``ArrayArrayCodec``, or dict representations
of ``ArrayArrayCodec``.
If ``filters`` and ``compressors`` are not specified, then the default codecs for
Zarr format 3 will be used.
These defaults can be changed by modifying the value of ``array.v3_default_codecs``
If no ``filters`` are provided, a default set of filters will be used.
These defaults can be changed by modifying the value of ``array.v3_default_filters``
in :mod:`zarr.core.config`.
Use ``None`` to omit default filters.

Expand All @@ -3803,22 +3795,22 @@ async def create_array(

For Zarr format 3, a "compressor" is a codec that takes a bytestream, and
returns another bytestream. Multiple compressors my be provided for Zarr format 3.
If ``filters`` and ``compressors`` are not specified, then the default codecs for
Zarr format 3 will be used.
These defaults can be changed by modifying the value of ``array.v3_default_codecs``
If no ``compressors`` are provided, a default set of compressors will be used.
These defaults can be changed by modifying the value of ``array.v3_default_compressors``
in :mod:`zarr.core.config`.
Use ``None`` to omit default compressors.

For Zarr format 2, a "compressor" can be any numcodecs codec. Only a single compressor may
be provided for Zarr format 2.
If no ``compressors`` are provided, a default compressor will be used.
These defaults can be changed by modifying the value of ``array.v2_default_compressor``
If no ``compressor`` is provided, a default compressor will be used.
in :mod:`zarr.core.config`.
Use ``None`` to omit the default compressor.
serializer : dict[str, JSON] | ArrayBytesCodec, optional
Array-to-bytes codec to use for encoding the array data.
Zarr format 3 only. Zarr format 2 arrays use implicit array-to-bytes conversion.
If no ``serializer`` is provided, the `zarr.codecs.BytesCodec` codec will be used.
If no ``serializer`` is provided, a default serializer will be used.
These defaults can be changed by modifying the value of ``array.v3_default_serializer``
in :mod:`zarr.core.config`.
fill_value : Any, optional
Fill value for the array.
order : {"C", "F"}, optional
Expand Down Expand Up @@ -3997,7 +3989,6 @@ def _get_default_chunk_encoding_v3(
"""
Get the default ArrayArrayCodecs, ArrayBytesCodec, and BytesBytesCodec for a given dtype.
"""
default_codecs = zarr_config.get("array.v3_default_codecs")
dtype = DataType.from_numpy(np_dtype)
if dtype == DataType.string:
dtype_key = "string"
Expand All @@ -4006,31 +3997,15 @@ def _get_default_chunk_encoding_v3(
else:
dtype_key = "numeric"

codec_dicts = default_codecs[dtype_key]
codecs = tuple(_resolve_codec(c) for c in codec_dicts)
array_bytes_maybe = None
array_array: list[ArrayArrayCodec] = []
bytes_bytes: list[BytesBytesCodec] = []

for codec in codecs:
if isinstance(codec, ArrayBytesCodec):
if array_bytes_maybe is not None:
raise ValueError(
f"Got two instances of ArrayBytesCodec: {array_bytes_maybe} and {codec}. "
"Only one array-to-bytes codec is allowed."
)
array_bytes_maybe = codec
elif isinstance(codec, ArrayArrayCodec):
array_array.append(codec)
elif isinstance(codec, BytesBytesCodec):
bytes_bytes.append(codec)
else:
raise TypeError(f"Unexpected codec type: {type(codec)}")
default_filters = zarr_config.get("array.v3_default_filters").get(dtype_key)
default_serializer = zarr_config.get("array.v3_default_serializer").get(dtype_key)
default_compressors = zarr_config.get("array.v3_default_compressors").get(dtype_key)

if array_bytes_maybe is None:
raise ValueError("Required ArrayBytesCodec was not found.")
filters = tuple(_parse_array_array_codec(codec_dict) for codec_dict in default_filters)
serializer = _parse_array_bytes_codec(default_serializer)
compressors = tuple(_parse_bytes_bytes_codec(codec_dict) for codec_dict in default_compressors)

return tuple(array_array), array_bytes_maybe, tuple(bytes_bytes)
return filters, serializer, compressors


def _get_default_chunk_encoding_v2(
Expand Down Expand Up @@ -4111,34 +4086,15 @@ def _parse_chunk_encoding_v3(
default_array_array, default_array_bytes, default_bytes_bytes = _get_default_chunk_encoding_v3(
dtype
)
maybe_bytes_bytes: Iterable[Codec | dict[str, JSON]]
maybe_array_array: Iterable[Codec | dict[str, JSON]]
out_bytes_bytes: tuple[BytesBytesCodec, ...]
if compressors is None:
out_bytes_bytes = ()

elif compressors == "auto":
out_bytes_bytes = default_bytes_bytes

else:
if isinstance(compressors, dict | Codec):
maybe_bytes_bytes = (compressors,)
elif compressors is None:
maybe_bytes_bytes = ()
else:
maybe_bytes_bytes = cast(Iterable[Codec | dict[str, JSON]], compressors)

out_bytes_bytes = tuple(_parse_bytes_bytes_codec(c) for c in maybe_bytes_bytes)
out_array_array: tuple[ArrayArrayCodec, ...]
if filters is None:
out_array_array = ()
out_array_array: tuple[ArrayArrayCodec, ...] = ()
elif filters == "auto":
out_array_array = default_array_array
else:
maybe_array_array: Iterable[Codec | dict[str, JSON]]
if isinstance(filters, dict | Codec):
maybe_array_array = (filters,)
elif filters is None:
maybe_array_array = ()
else:
maybe_array_array = cast(Iterable[Codec | dict[str, JSON]], filters)
out_array_array = tuple(_parse_array_array_codec(c) for c in maybe_array_array)
Expand All @@ -4148,6 +4104,19 @@ def _parse_chunk_encoding_v3(
else:
out_array_bytes = _parse_array_bytes_codec(serializer)

if compressors is None:
out_bytes_bytes: tuple[BytesBytesCodec, ...] = ()
elif compressors == "auto":
out_bytes_bytes = default_bytes_bytes
else:
maybe_bytes_bytes: Iterable[Codec | dict[str, JSON]]
if isinstance(compressors, dict | Codec):
maybe_bytes_bytes = (compressors,)
else:
maybe_bytes_bytes = cast(Iterable[Codec | dict[str, JSON]], compressors)

out_bytes_bytes = tuple(_parse_bytes_bytes_codec(c) for c in maybe_bytes_bytes)

return out_array_array, out_array_bytes, out_bytes_bytes


Expand Down
11 changes: 7 additions & 4 deletions src/zarr/core/config.py
Original file line number Diff line number Diff line change
Expand Up @@ -76,17 +76,20 @@ def reset(self) -> None:
"string": [{"id": "vlen-utf8"}],
"bytes": [{"id": "vlen-bytes"}],
},
"v3_default_codecs": {
"v3_default_filters": {"numeric": [], "string": [], "bytes": []},
"v3_default_serializer": {
"numeric": {"name": "bytes", "configuration": {"endian": "little"}},
"string": {"name": "vlen-utf8"},
"bytes": {"name": "vlen-bytes"},
},
"v3_default_compressors": {
"numeric": [
{"name": "bytes", "configuration": {"endian": "little"}},
{"name": "zstd", "configuration": {"level": 0, "checksum": False}},
],
"string": [
{"name": "vlen-utf8"},
{"name": "zstd", "configuration": {"level": 0, "checksum": False}},
],
"bytes": [
{"name": "vlen-bytes"},
{"name": "zstd", "configuration": {"level": 0, "checksum": False}},
],
},
Expand Down
Loading
Loading