-
Notifications
You must be signed in to change notification settings - Fork 24
add s3path #291
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
add s3path #291
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for your contribution to s3-connector-for-pytorch! Please, give us a few days for a review.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for your contribution implementing S3Path. To maintain our high quality standards, could you please enhance your changes with:
- Integration tests using a real S3 bucket
- Error logging for key operations
- Docstrings explaining behavior, especially where it differs from standard pathlib
- Brief section in README.md with a usage example
Please let me know if you need any clarification or assistance with these additions.
mode = stat.S_IFDIR | ||
except S3Exception: | ||
try: | ||
listobjs = list(self._client.list_objects(self.bucket, self.key)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we limit list_objects
output with max_keys=2
as we only need to know that result is not empty?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is max_keys=1 enough?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we know that there are no other objects with the name equals to self.key
, then yes it should be enough.
raise FileNotFoundError(error_msg) from e | ||
except S3Exception: | ||
raise FileNotFoundError(error_msg) from e | ||
return os.stat_result( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it be better to cache results since S3 objects are typically immutable and stat
information is frequently reused across different operations. This would help reduce API calls to S3 and improve performance. We can add TTL-based caching (default to 1 second) and allow users to configure or disable caching. Caching could be done on class-level with including cache size to prevent memory issues. That should keep usage experience the same in general without need to configure cache per instance.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree, can add.
One issue here that I don't have an answer for is usage for example in SPMD. since class attributes are isolated per process, without communication there is still some ineffiency there if some ranks call stat on the same file
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we start with caching stat
per instance only and passing caching configuration through constructor? I don't know pattern of using Path like primitives, would it be too cumbersome to turn-off cache via constructor? If we generally create new instances of S3Path from another instances of S3Path, then we can just pass the cache settings between them.
|
||
def rmdir(self): | ||
try: | ||
next(self.iterdir()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we introduce method that check if the path
is empty? As mentioned for stat
method we can limit output of list_objects
with max_keys=2
to improve performance. To confirm that object is directory, we can call is_dir
directly in that method, before checking if it is empty.
split = self.parser.split | ||
if split(name)[0]: | ||
raise ValueError(f"Invalid name {name!r}") | ||
return self.with_segments(split(self._raw_path)[0], name) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we replace split(self._raw_path)[0]
wit self.parent
for improving readability?
empty_folder.mkdir(parents=True, exist_ok=True) | ||
empty_folder.rmdir() | ||
with pytest.raises(NotADirectoryError, match=f"{empty_folder} is not an s3 folder"): | ||
time.sleep(1) # S3 needs some time to register the deletion |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In unit tests we are not communicating with S3. Wouldn't it work without sleep
in that case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also assumed the same, but it seemed like tests wouldn't pass without the sleep.
Maybe some limitation of the MockS3Client?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interesting. I will take a look into it.
s3_path._client.add_object("test-key/test_file.txt", b"") | ||
assert file.exists() | ||
file.unlink() | ||
time.sleep(1) # S3 needs some time to register the deletion |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The same question, if we need sleep
here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also assumed the same, but it seemed like tests wouldn't pass without the sleep.
Maybe some limitation of the MockS3Client?
Description
Provide implementation of Pathlib abc with S3 as the backing file store. Utilizes how s3connector exposes read and write streams via Mountpoint CRT client.
Additional context
This will be utilized to easily integrate with Pytorch DCP implementations such as Megatron-Core distributed checkpointing
https://docs.nvidia.com/megatron-core/developer-guide/latest/api-guide/dist_checkpointing.html
Related items
Testing
By submitting this pull request, I confirm that my contribution is made under the terms of BSD 3-Clause License and I agree to the terms of the LICENSE.