Skip to content

add s3path #291

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: main
Choose a base branch
from
Open

add s3path #291

wants to merge 1 commit into from

Conversation

ryxli
Copy link
Contributor

@ryxli ryxli commented Dec 27, 2024

Description

Provide implementation of Pathlib abc with S3 as the backing file store. Utilizes how s3connector exposes read and write streams via Mountpoint CRT client.

Additional context

This will be utilized to easily integrate with Pytorch DCP implementations such as Megatron-Core distributed checkpointing
https://docs.nvidia.com/megatron-core/developer-guide/latest/api-guide/dist_checkpointing.html

  • I have updated the CHANGELOG or README if appropriate

Related items

Testing


By submitting this pull request, I confirm that my contribution is made under the terms of BSD 3-Clause License and I agree to the terms of the LICENSE.

@ryxli ryxli requested a review from a team as a code owner December 27, 2024 00:05
Copy link
Contributor

@IsaevIlya IsaevIlya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for your contribution to s3-connector-for-pytorch! Please, give us a few days for a review.

Copy link
Contributor

@IsaevIlya IsaevIlya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for your contribution implementing S3Path. To maintain our high quality standards, could you please enhance your changes with:

  1. Integration tests using a real S3 bucket
  2. Error logging for key operations
  3. Docstrings explaining behavior, especially where it differs from standard pathlib
  4. Brief section in README.md with a usage example

Please let me know if you need any clarification or assistance with these additions.

mode = stat.S_IFDIR
except S3Exception:
try:
listobjs = list(self._client.list_objects(self.bucket, self.key))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we limit list_objects output with max_keys=2 as we only need to know that result is not empty?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is max_keys=1 enough?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we know that there are no other objects with the name equals to self.key, then yes it should be enough.

raise FileNotFoundError(error_msg) from e
except S3Exception:
raise FileNotFoundError(error_msg) from e
return os.stat_result(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be better to cache results since S3 objects are typically immutable and stat information is frequently reused across different operations. This would help reduce API calls to S3 and improve performance. We can add TTL-based caching (default to 1 second) and allow users to configure or disable caching. Caching could be done on class-level with including cache size to prevent memory issues. That should keep usage experience the same in general without need to configure cache per instance.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree, can add.

One issue here that I don't have an answer for is usage for example in SPMD. since class attributes are isolated per process, without communication there is still some ineffiency there if some ranks call stat on the same file

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we start with caching stat per instance only and passing caching configuration through constructor? I don't know pattern of using Path like primitives, would it be too cumbersome to turn-off cache via constructor? If we generally create new instances of S3Path from another instances of S3Path, then we can just pass the cache settings between them.


def rmdir(self):
try:
next(self.iterdir())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we introduce method that check if the path is empty? As mentioned for stat method we can limit output of list_objects with max_keys=2 to improve performance. To confirm that object is directory, we can call is_dir directly in that method, before checking if it is empty.

split = self.parser.split
if split(name)[0]:
raise ValueError(f"Invalid name {name!r}")
return self.with_segments(split(self._raw_path)[0], name)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we replace split(self._raw_path)[0] wit self.parent for improving readability?

@ryxli ryxli had a problem deploying to integration-tests January 7, 2025 21:40 — with GitHub Actions Failure
@ryxli ryxli had a problem deploying to integration-tests January 7, 2025 21:41 — with GitHub Actions Failure
@ryxli ryxli had a problem deploying to integration-tests January 7, 2025 21:41 — with GitHub Actions Failure
@ryxli ryxli had a problem deploying to integration-tests January 7, 2025 21:41 — with GitHub Actions Failure
@ryxli ryxli had a problem deploying to integration-tests January 7, 2025 21:41 — with GitHub Actions Failure
@ryxli ryxli had a problem deploying to integration-tests January 7, 2025 21:41 — with GitHub Actions Failure
@ryxli ryxli had a problem deploying to integration-tests January 7, 2025 21:41 — with GitHub Actions Failure
@ryxli ryxli had a problem deploying to integration-tests January 7, 2025 23:02 — with GitHub Actions Failure
@ryxli ryxli had a problem deploying to integration-tests January 7, 2025 23:02 — with GitHub Actions Failure
@ryxli ryxli had a problem deploying to integration-tests January 7, 2025 23:02 — with GitHub Actions Failure
@ryxli ryxli had a problem deploying to integration-tests January 7, 2025 23:02 — with GitHub Actions Failure
@ryxli ryxli had a problem deploying to integration-tests January 7, 2025 23:02 — with GitHub Actions Failure
@ryxli ryxli had a problem deploying to integration-tests January 7, 2025 23:02 — with GitHub Actions Failure
@ryxli ryxli had a problem deploying to integration-tests January 7, 2025 23:02 — with GitHub Actions Failure
@ryxli ryxli had a problem deploying to integration-tests January 7, 2025 23:02 — with GitHub Actions Failure
@ryxli ryxli had a problem deploying to integration-tests January 7, 2025 23:02 — with GitHub Actions Failure
@ryxli ryxli had a problem deploying to integration-tests January 7, 2025 23:02 — with GitHub Actions Failure
@ryxli ryxli had a problem deploying to integration-tests January 7, 2025 23:02 — with GitHub Actions Failure
@ryxli ryxli had a problem deploying to integration-tests January 7, 2025 23:02 — with GitHub Actions Failure
@ryxli ryxli had a problem deploying to integration-tests January 7, 2025 23:02 — with GitHub Actions Failure
@ryxli ryxli had a problem deploying to integration-tests January 7, 2025 23:02 — with GitHub Actions Failure
@ryxli ryxli had a problem deploying to integration-tests January 7, 2025 23:02 — with GitHub Actions Failure
@ryxli ryxli had a problem deploying to integration-tests January 7, 2025 23:02 — with GitHub Actions Failure
@ryxli ryxli had a problem deploying to integration-tests January 7, 2025 23:02 — with GitHub Actions Failure
@ryxli ryxli had a problem deploying to integration-tests January 7, 2025 23:02 — with GitHub Actions Failure
@ryxli ryxli had a problem deploying to integration-tests January 7, 2025 23:02 — with GitHub Actions Failure
@ryxli ryxli had a problem deploying to integration-tests January 7, 2025 23:02 — with GitHub Actions Failure
empty_folder.mkdir(parents=True, exist_ok=True)
empty_folder.rmdir()
with pytest.raises(NotADirectoryError, match=f"{empty_folder} is not an s3 folder"):
time.sleep(1) # S3 needs some time to register the deletion
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In unit tests we are not communicating with S3. Wouldn't it work without sleep in that case?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also assumed the same, but it seemed like tests wouldn't pass without the sleep.

Maybe some limitation of the MockS3Client?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting. I will take a look into it.

s3_path._client.add_object("test-key/test_file.txt", b"")
assert file.exists()
file.unlink()
time.sleep(1) # S3 needs some time to register the deletion
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The same question, if we need sleep here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also assumed the same, but it seemed like tests wouldn't pass without the sleep.

Maybe some limitation of the MockS3Client?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants