-
-
Notifications
You must be signed in to change notification settings - Fork 368
feat: change array creation signature to allow sharding specification [do not merge] #2169
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| _codecs = tuple(codecs) if codecs is not None else (BytesCodec(),) | ||
|
|
||
| if shard_shape is not None: | ||
| _codecs = (ShardingCodec(chunk_shape=shard_shape, codecs=_codecs),) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This effectively hard-codes sharding into the spec, something like a sharded=True flag that might have existed on the CHunkSpec object. How do you expect this to extend to variable chunking or other schemes that might be created in the future?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This proposal can use whatever specification for variable length chunks we come up with, e.g. tuples of tuples of ints. You could specify variable length chunking with no sharding via something like chunks = {'write_shape': ((10,5), (1,2,3)}, and variable length chunking with sharding via something like chunks = {'write_shape: ((10,5), (1,2,3)), 'read_shape': (1,1)}. The read shape would have to checked for consistency with all the unique chunk shapes in this case. We would of course need to widen the type of ChunkSpec for this to accept tuple[tuple[int, ...]] for the write_shape keys.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
something like a sharded=True flag that might have existed on the CHunkSpec object.
If we had such a flag on the chunkspec object, then it would semantically collide with read_shape: {'write_shape': (10,10), 'read_shape': (2,2), sharding: False} would not be valid, because there's no way to have read_shape and write_shape differ without sharding. BTW when I say "sharding" i don't mean "the sharding codec", I mean the general concept of packing multiple subchunks into a single file. If a non-codec implementation of sharding emerges, then I would like to imagine that this API could wrap that.
|
I guess this was superseded by #2463? |
|
yes this is very superseded. it can be closed. |
The goal of this PR is to demonstrate one strategy to simplify the creation of arrays that use sharding. Don't consider merging this until we get a good look at some alternatives.
This PR alters the
Array.createroutine, removing thechunk_shapekwarg and instead beefing up the semantics of thechunkskwarg. Specifically, thechunkskwarg supports a new variant,ChunkSpec, which aims to compactly specify both the chunk shape of an array as well as the (optional) sub-chunk shape.ChunkSpecis a typed dictionary with two keys:read_shapeandwrite_shape.write_shapespecifies the shape of array chunks that can be written concurrently, i.e. the shape in array coordinates of the chunk files.read_shapespecifies the shape of array chunks that can be read concurrently, i.e. the shape in array coordinates of the sub-chunks contained in a chunk constructed with a sharding codec.chunks = Noneorchunks = {}(we support the latter case because of how non-total typeddicts work) toArray.createwill automatically specify chunks using old v2 logic.chunks = {'write_shape': (20, 20)}ORchunks = {'read_shape': (20, 20)}toArray.createwill configure that array with no sharding and a chunk size of (20,20).chunks = {'write_shape': (20, 20), 'read_shape': (10,10)}toArray.createwill configure that array with sharding, with a sub-chunk size of (10,10), and a chunk size of (20,20). This will also route all the of the user-specifiedcodecs, if any, to the sharding codec.Note that this PR does not change the signature of the array class itself. That would be a separate effort.
addresses #2170
TODO: