add s3path #291

ryxli · 2024-12-27T00:05:09Z

Description

Provide implementation of Pathlib abc with S3 as the backing file store. Utilizes how s3connector exposes read and write streams via Mountpoint CRT client.

Additional context

This will be utilized to easily integrate with Pytorch DCP implementations such as Megatron-Core distributed checkpointing
https://docs.nvidia.com/megatron-core/developer-guide/latest/api-guide/dist_checkpointing.html

I have updated the CHANGELOG or README if appropriate

Related items

Testing

By submitting this pull request, I confirm that my contribution is made under the terms of BSD 3-Clause License and I agree to the terms of the LICENSE.

s3torchconnector/src/s3torchconnector/s3path.py

IsaevIlya

Thank you for your contribution to s3-connector-for-pytorch! Please, give us a few days for a review.

IsaevIlya

Thank you for your contribution implementing S3Path. To maintain our high quality standards, could you please enhance your changes with:

Integration tests using a real S3 bucket
Error logging for key operations
Docstrings explaining behavior, especially where it differs from standard pathlib
Brief section in README.md with a usage example

Please let me know if you need any clarification or assistance with these additions.

s3torchconnector/src/s3torchconnector/s3path.py

IsaevIlya · 2025-01-07T11:57:04Z

s3torchconnector/src/s3torchconnector/s3path.py

+                mode = stat.S_IFDIR
+            except S3Exception:
+                try:
+                    listobjs = list(self._client.list_objects(self.bucket, self.key))


Could we limit list_objects output with max_keys=2 as we only need to know that result is not empty?

Is max_keys=1 enough?

If we know that there are no other objects with the name equals to self.key, then yes it should be enough.

IsaevIlya · 2025-01-07T13:10:17Z

s3torchconnector/src/s3torchconnector/s3path.py

+                        raise FileNotFoundError(error_msg) from e
+                except S3Exception:
+                    raise FileNotFoundError(error_msg) from e
+        return os.stat_result(


Would it be better to cache results since S3 objects are typically immutable and stat information is frequently reused across different operations. This would help reduce API calls to S3 and improve performance. We can add TTL-based caching (default to 1 second) and allow users to configure or disable caching. Caching could be done on class-level with including cache size to prevent memory issues. That should keep usage experience the same in general without need to configure cache per instance.

Agree, can add.

One issue here that I don't have an answer for is usage for example in SPMD. since class attributes are isolated per process, without communication there is still some ineffiency there if some ranks call stat on the same file

Could we start with caching stat per instance only and passing caching configuration through constructor? I don't know pattern of using Path like primitives, would it be too cumbersome to turn-off cache via constructor? If we generally create new instances of S3Path from another instances of S3Path, then we can just pass the cache settings between them.

IsaevIlya · 2025-01-07T13:27:26Z

s3torchconnector/src/s3torchconnector/s3path.py

+
+    def rmdir(self):
+        try:
+            next(self.iterdir())


Should we introduce method that check if the path is empty? As mentioned for stat method we can limit output of list_objects with max_keys=2 to improve performance. To confirm that object is directory, we can call is_dir directly in that method, before checking if it is empty.

IsaevIlya · 2025-01-07T14:18:31Z

s3torchconnector/src/s3torchconnector/s3path.py

+        split = self.parser.split
+        if split(name)[0]:
+            raise ValueError(f"Invalid name {name!r}")
+        return self.with_segments(split(self._raw_path)[0], name)


Could we replace split(self._raw_path)[0] wit self.parent for improving readability?

s3torchconnector/src/s3torchconnector/s3path.py

s3torchconnector/tst/unit/test_s3path.py

IsaevIlya · 2025-01-08T15:21:30Z

s3torchconnector/tst/unit/test_s3path.py

+    empty_folder.mkdir(parents=True, exist_ok=True)
+    empty_folder.rmdir()
+    with pytest.raises(NotADirectoryError, match=f"{empty_folder} is not an s3 folder"):
+        time.sleep(1)  # S3 needs some time to register the deletion


In unit tests we are not communicating with S3. Wouldn't it work without sleep in that case?

I also assumed the same, but it seemed like tests wouldn't pass without the sleep.

Maybe some limitation of the MockS3Client?

Interesting. I will take a look into it.

IsaevIlya · 2025-01-22T13:05:58Z

s3torchconnector/tst/unit/test_s3path.py

+    s3_path._client.add_object("test-key/test_file.txt", b"")
+    assert file.exists()
+    file.unlink()
+    time.sleep(1)  # S3 needs some time to register the deletion


The same question, if we need sleep here

I also assumed the same, but it seemed like tests wouldn't pass without the sleep.

Maybe some limitation of the MockS3Client?

ryxli requested a review from a team as a code owner December 27, 2024 00:05

ryxli had a problem deploying to integration-tests December 27, 2024 00:05 — with GitHub Actions Failure

ryxli commented Jan 3, 2025

View reviewed changes

s3torchconnector/src/s3torchconnector/s3path.py Show resolved Hide resolved

IsaevIlya reviewed Jan 6, 2025

View reviewed changes

IsaevIlya requested changes Jan 7, 2025

View reviewed changes

ryxli force-pushed the pr_s3path branch from 2757045 to dd79133 Compare January 7, 2025 21:40

ryxli had a problem deploying to integration-tests January 7, 2025 21:40 — with GitHub Actions Failure

ryxli had a problem deploying to integration-tests January 7, 2025 21:41 — with GitHub Actions Failure

add s3path

9cded51

ryxli force-pushed the pr_s3path branch from dd79133 to 9cded51 Compare January 7, 2025 23:02

ryxli had a problem deploying to integration-tests January 7, 2025 23:02 — with GitHub Actions Failure

IsaevIlya reviewed Jan 22, 2025

View reviewed changes

add s3path #291

Are you sure you want to change the base?

add s3path #291

Uh oh!

Conversation

ryxli commented Dec 27, 2024

Description

Additional context

Related items

Testing

Uh oh!

Uh oh!

IsaevIlya left a comment

Choose a reason for hiding this comment

Uh oh!

IsaevIlya left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!