|
3 | 3 | Zarr Storage Specification Version 1
|
4 | 4 | ====================================
|
5 | 5 |
|
6 |
| -This document provides a technical specification of the protocol and |
7 |
| -format used for storing a Zarr array. The key words "MUST", "MUST |
8 |
| -NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", |
9 |
| -"RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be |
10 |
| -interpreted as described in `RFC 2119 |
11 |
| -<https://www.ietf.org/rfc/rfc2119.txt>`_. |
12 |
| - |
13 |
| -Status |
14 |
| ------- |
15 |
| - |
16 |
| -This specification is deprecated. See :ref:`spec` for the latest version. |
17 |
| - |
18 |
| -Storage |
19 |
| -------- |
20 |
| - |
21 |
| -A Zarr array can be stored in any storage system that provides a |
22 |
| -key/value interface, where a key is an ASCII string and a value is an |
23 |
| -arbitrary sequence of bytes, and the supported operations are read |
24 |
| -(get the sequence of bytes associated with a given key), write (set |
25 |
| -the sequence of bytes associated with a given key) and delete (remove |
26 |
| -a key/value pair). |
27 |
| - |
28 |
| -For example, a directory in a file system can provide this interface, |
29 |
| -where keys are file names, values are file contents, and files can be |
30 |
| -read, written or deleted via the operating system. Equally, an S3 |
31 |
| -bucket can provide this interface, where keys are resource names, |
32 |
| -values are resource contents, and resources can be read, written or |
33 |
| -deleted via HTTP. |
34 |
| - |
35 |
| -Below an "array store" refers to any system implementing this |
36 |
| -interface. |
37 |
| - |
38 |
| -Metadata |
39 |
| --------- |
40 |
| - |
41 |
| -Each array requires essential configuration metadata to be stored, |
42 |
| -enabling correct interpretation of the stored data. This metadata is |
43 |
| -encoded using JSON and stored as the value of the 'meta' key within an |
44 |
| -array store. |
45 |
| - |
46 |
| -The metadata resource is a JSON object. The following keys MUST be |
47 |
| -present within the object: |
48 |
| - |
49 |
| -zarr_format |
50 |
| - An integer defining the version of the storage specification to which the |
51 |
| - array store adheres. |
52 |
| -shape |
53 |
| - A list of integers defining the length of each dimension of the array. |
54 |
| -chunks |
55 |
| - A list of integers defining the length of each dimension of a chunk of the |
56 |
| - array. Note that all chunks within a Zarr array have the same shape. |
57 |
| -dtype |
58 |
| - A string or list defining a valid data type for the array. See also |
59 |
| - the subsection below on data type encoding. |
60 |
| -compression |
61 |
| - A string identifying the primary compression library used to compress |
62 |
| - each chunk of the array. |
63 |
| -compression_opts |
64 |
| - An integer, string or dictionary providing options to the primary |
65 |
| - compression library. |
66 |
| -fill_value |
67 |
| - A scalar value providing the default value to use for uninitialized |
68 |
| - portions of the array. |
69 |
| -order |
70 |
| - Either 'C' or 'F', defining the layout of bytes within each chunk of the |
71 |
| - array. 'C' means row-major order, i.e., the last dimension varies fastest; |
72 |
| - 'F' means column-major order, i.e., the first dimension varies fastest. |
73 |
| - |
74 |
| -Other keys MAY be present within the metadata object however they MUST |
75 |
| -NOT alter the interpretation of the required fields defined above. |
76 |
| - |
77 |
| -For example, the JSON object below defines a 2-dimensional array of |
78 |
| -64-bit little-endian floating point numbers with 10000 rows and 10000 |
79 |
| -columns, divided into chunks of 1000 rows and 1000 columns (so there |
80 |
| -will be 100 chunks in total arranged in a 10 by 10 grid). Within each |
81 |
| -chunk the data are laid out in C contiguous order, and each chunk is |
82 |
| -compressed using the Blosc compression library:: |
83 |
| - |
84 |
| - { |
85 |
| - "chunks": [ |
86 |
| - 1000, |
87 |
| - 1000 |
88 |
| - ], |
89 |
| - "compression": "blosc", |
90 |
| - "compression_opts": { |
91 |
| - "clevel": 5, |
92 |
| - "cname": "lz4", |
93 |
| - "shuffle": 1 |
94 |
| - }, |
95 |
| - "dtype": "<f8", |
96 |
| - "fill_value": null, |
97 |
| - "order": "C", |
98 |
| - "shape": [ |
99 |
| - 10000, |
100 |
| - 10000 |
101 |
| - ], |
102 |
| - "zarr_format": 1 |
103 |
| - } |
104 |
| - |
105 |
| -Data type encoding |
106 |
| -~~~~~~~~~~~~~~~~~~ |
107 |
| - |
108 |
| -Simple data types are encoded within the array metadata resource as a |
109 |
| -string, following the `NumPy array protocol type string (typestr) |
110 |
| -format |
111 |
| -<numpy:arrays.interface>`_. The |
112 |
| -format consists of 3 parts: a character describing the byteorder of |
113 |
| -the data (``<``: little-endian, ``>``: big-endian, ``|``: |
114 |
| -not-relevant), a character code giving the basic type of the array, |
115 |
| -and an integer providing the number of bytes the type uses. The byte |
116 |
| -order MUST be specified. E.g., ``"<f8"``, ``">i4"``, ``"|b1"`` and |
117 |
| -``"|S12"`` are valid data types. |
118 |
| - |
119 |
| -Structure data types (i.e., with multiple named fields) are encoded as |
120 |
| -a list of two-element lists, following `NumPy array protocol type |
121 |
| -descriptions (descr) |
122 |
| -<numpy:arrays.interface>`_. |
123 |
| -For example, the JSON list ``[["r", "|u1"], ["g", "|u1"], ["b", |
124 |
| -"|u1"]]`` defines a data type composed of three single-byte unsigned |
125 |
| -integers labelled 'r', 'g' and 'b'. |
126 |
| - |
127 |
| -Chunks |
128 |
| ------- |
129 |
| - |
130 |
| -Each chunk of the array is compressed by passing the raw bytes for the |
131 |
| -chunk through the primary compression library to obtain a new sequence |
132 |
| -of bytes comprising the compressed chunk data. No header is added to |
133 |
| -the compressed bytes or any other modification made. The internal |
134 |
| -structure of the compressed bytes will depend on which primary |
135 |
| -compressor was used. For example, the `Blosc compressor |
136 |
| -<https://github.com/Blosc/c-blosc/blob/main/README_HEADER.rst>`_ |
137 |
| -produces a sequence of bytes that begins with a 16-byte header |
138 |
| -followed by compressed data. |
139 |
| - |
140 |
| -The compressed sequence of bytes for each chunk is stored under a key |
141 |
| -formed from the index of the chunk within the grid of chunks |
142 |
| -representing the array. To form a string key for a chunk, the indices |
143 |
| -are converted to strings and concatenated with the period character |
144 |
| -('.') separating each index. For example, given an array with shape |
145 |
| -(10000, 10000) and chunk shape (1000, 1000) there will be 100 chunks |
146 |
| -laid out in a 10 by 10 grid. The chunk with indices (0, 0) provides |
147 |
| -data for rows 0-999 and columns 0-999 and is stored under the key |
148 |
| -'0.0'; the chunk with indices (2, 4) provides data for rows 2000-2999 |
149 |
| -and columns 4000-4999 and is stored under the key '2.4'; etc. |
150 |
| - |
151 |
| -There is no need for all chunks to be present within an array |
152 |
| -store. If a chunk is not present then it is considered to be in an |
153 |
| -uninitialized state. An uninitialized chunk MUST be treated as if it |
154 |
| -was uniformly filled with the value of the 'fill_value' field in the |
155 |
| -array metadata. If the 'fill_value' field is ``null`` then the |
156 |
| -contents of the chunk are undefined. |
157 |
| - |
158 |
| -Note that all chunks in an array have the same shape. If the length of |
159 |
| -any array dimension is not exactly divisible by the length of the |
160 |
| -corresponding chunk dimension then some chunks will overhang the edge |
161 |
| -of the array. The contents of any chunk region falling outside the |
162 |
| -array are undefined. |
163 |
| - |
164 |
| -Attributes |
165 |
| ----------- |
166 |
| - |
167 |
| -Each array can also be associated with custom attributes, which are |
168 |
| -simple key/value items with application-specific meaning. Custom |
169 |
| -attributes are encoded as a JSON object and stored under the 'attrs' |
170 |
| -key within an array store. Even if the attributes are empty, the |
171 |
| -'attrs' key MUST be present within an array store. |
172 |
| - |
173 |
| -For example, the JSON object below encodes three attributes named |
174 |
| -'foo', 'bar' and 'baz':: |
175 |
| - |
176 |
| - { |
177 |
| - "foo": 42, |
178 |
| - "bar": "apples", |
179 |
| - "baz": [1, 2, 3, 4] |
180 |
| - } |
181 |
| - |
182 |
| -Example |
183 |
| -------- |
184 |
| - |
185 |
| -Below is an example of storing a Zarr array, using a directory on the |
186 |
| -local file system as storage. |
187 |
| - |
188 |
| -Initialize the store:: |
189 |
| - |
190 |
| - >>> import zarr |
191 |
| - >>> store = zarr.DirectoryStore('example.zarr') |
192 |
| - >>> zarr.init_store(store, shape=(20, 20), chunks=(10, 10), |
193 |
| - ... dtype='i4', fill_value=42, compression='zlib', |
194 |
| - ... compression_opts=1, overwrite=True) |
195 |
| - |
196 |
| -No chunks are initialized yet, so only the 'meta' and 'attrs' keys |
197 |
| -have been set:: |
198 |
| - |
199 |
| - >>> import os |
200 |
| - >>> sorted(os.listdir('example.zarr')) |
201 |
| - ['attrs', 'meta'] |
202 |
| - |
203 |
| -Inspect the array metadata:: |
204 |
| - |
205 |
| - >>> print(open('example.zarr/meta').read()) |
206 |
| - { |
207 |
| - "chunks": [ |
208 |
| - 10, |
209 |
| - 10 |
210 |
| - ], |
211 |
| - "compression": "zlib", |
212 |
| - "compression_opts": 1, |
213 |
| - "dtype": "<i4", |
214 |
| - "fill_value": 42, |
215 |
| - "order": "C", |
216 |
| - "shape": [ |
217 |
| - 20, |
218 |
| - 20 |
219 |
| - ], |
220 |
| - "zarr_format": 1 |
221 |
| - } |
222 |
| - |
223 |
| -Inspect the array attributes:: |
224 |
| - |
225 |
| - >>> print(open('example.zarr/attrs').read()) |
226 |
| - {} |
227 |
| - |
228 |
| -Set some data:: |
229 |
| - |
230 |
| - >>> z = zarr.Array(store) |
231 |
| - >>> z[0:10, 0:10] = 1 |
232 |
| - >>> sorted(os.listdir('example.zarr')) |
233 |
| - ['0.0', 'attrs', 'meta'] |
234 |
| - |
235 |
| -Set some more data:: |
236 |
| - |
237 |
| - >>> z[0:10, 10:20] = 2 |
238 |
| - >>> z[10:20, :] = 3 |
239 |
| - >>> sorted(os.listdir('example.zarr')) |
240 |
| - ['0.0', '0.1', '1.0', '1.1', 'attrs', 'meta'] |
241 |
| - |
242 |
| -Manually decompress a single chunk for illustration:: |
243 |
| - |
244 |
| - >>> import zlib |
245 |
| - >>> b = zlib.decompress(open('example.zarr/0.0', 'rb').read()) |
246 |
| - >>> import numpy as np |
247 |
| - >>> a = np.frombuffer(b, dtype='<i4') |
248 |
| - >>> a |
249 |
| - array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, |
250 |
| - 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, |
251 |
| - 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, |
252 |
| - 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, |
253 |
| - 1, 1, 1, 1, 1, 1, 1, 1], dtype=int32) |
254 |
| - |
255 |
| -Modify the array attributes:: |
256 |
| - |
257 |
| - >>> z.attrs['foo'] = 42 |
258 |
| - >>> z.attrs['bar'] = 'apples' |
259 |
| - >>> z.attrs['baz'] = [1, 2, 3, 4] |
260 |
| - >>> print(open('example.zarr/attrs').read()) |
261 |
| - { |
262 |
| - "bar": "apples", |
263 |
| - "baz": [ |
264 |
| - 1, |
265 |
| - 2, |
266 |
| - 3, |
267 |
| - 4 |
268 |
| - ], |
269 |
| - "foo": 42 |
270 |
| - } |
| 6 | +The V1 Specification has been migrated to its website → |
| 7 | +https://zarr-specs.readthedocs.io/. |
0 commit comments