-
-
Notifications
You must be signed in to change notification settings - Fork 344
Fix race condition in from_array for arrays with shards #3217
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #3217 +/- ##
==========================================
- Coverage 94.76% 94.76% -0.01%
==========================================
Files 78 78
Lines 8642 8648 +6
==========================================
+ Hits 8190 8195 +5
- Misses 452 453 +1
🚀 New features to boost your workflow:
|
@@ -1265,13 +1287,19 @@ def _iter_chunk_keys( | |||
yield self.metadata.encode_chunk_key(k) | |||
|
|||
def _iter_chunk_regions( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel like _iter_chunk_regions
should only iterate over the regions spanned by each chunk. Otherwise the name doesn't fit. So adding a flag to this function that makes it do something different (iterate over the regions spanned by each shard) seems worse than implementing a new _iter_shard_regions
method, that does exactly what its name suggests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
my general POV is that several well-defined functions is better than a smaller number of functions that try to do a lot. since this is private API, adding functions is cheap, so lets create new functions instead of adding functionality to existing ones in this case
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm.. "private API" might be the magic word there 😂 There are no other users of _iter_chunk_regions
(ignoring the a few unit tests); so- may I directly rename the function to _iter_shard_regions
in this case? 😁
Fixes #3169.
When passing data via
create_array(data = ...)
, the data is inserted byfrom_array
after getting split by chunks. In current main, when using shards,from_array
ends up splitting the data by sub-chunks (due toAsyncArray.chunks
returningMetadata.chunks
which returns the size of the chunks inside shards instead of the size of the "physical" chunk files), which ends up creating a race condition when the writes to multiple sub-chunk end up writing to the same physical chunk file.This PR changes
from_array
to instead split the data by shard in case there are shards.(Probably, there needs to be a lock somewhere in AsyncArray or ShardCodec which prevents multiple writes to the same shard (or, just physical chunk in general) to execute at the same time)
TODO:
docs/user-guide/*.rst
changes/