-
Notifications
You must be signed in to change notification settings - Fork 46
feat(stream_io): Improve object storage config handling #189
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
sangstar
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Test cases passing on my end.
Some systems have a write size limit of around 2^31 - 1, so this limits the size of any os.pwrite call to be less than that on platforms other than Linux.
|
|
||
| def _is_caios(endpoint: str) -> bool: | ||
| host = urlparse(endpoint if "//" in endpoint else "//" + endpoint).netloc | ||
| return host.lower() in {"cwobject.com", "cwlota.com"} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Really not a fan of us hardcoding these like this (or the ord1 defaults we have had elsewhere for some time). But can be convinced this is ok.
For reference to those other hardcoded values that should be either A) dropped, or B) changed to our current object storage offering.
default_s3_read_endpoint = "accel-object.ord1.coreweave.com"
default_s3_write_endpoint = "object.ord1.coreweave.com"There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm pretty sure in boto3's source tree there is a large data segment that has these sorts of hardcoded compatibility checks for a huge set of different AWS endpoints, which is how Boto3's "magically pick the right one" feature works. Unless/until we upstream a change to support our object storage endpoints correctly, which I'm not sure they'd accept, this seems to me like a fair place to put a workaround.
The other hardcoded values are necessary for the default s3://tensorized bucket to "just work" since they're in that location. I'd like to change it to the newer endpoint in release 3.0.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this is a pattern already used in boto3 then probably fine... although we are not automatically changing the force_http based on if we are using cwlota.com or not which would fall into the same pattern as what we are doing here. Same points apply on both sides there as well as if a user configures it correctly it will work, but by default with a minimal amount of configuring it will not.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am nervous about defaulting force_http=True for the cwlota.com endpoint on the basis that it could transmit a presigned URL as part of a request over an unknown network in cleartext if it is used somewhere where cwlota.com doesn't resolve to a local network (so not a CW datacenter, maybe from someone's own machine where they're trying to test something), although I think the presigned URL would only be valid for use by someone who can access cwlota.com endpoints. I'd rather leave that one specifically as something that needs to be declared explicitly in the config.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As long as it works correctly when http://cwlota.com is in the config or environment variables as the endpoint then I don't strongly think it needs to have something explicitly forcing it here.
sangstar
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changes confirmed as working on my end once again, LGTM. Will defer to Andrew and his review comments.
Endpoints parsed from Boto3-style configuration files or environment variables now use either HTTP or HTTPS depending on the scheme present on the endpoint.
|
This now allows using HTTP instead of HTTPS when using an endpoint parsed from a Boto3 config file or its environment variables. Please take another look. (There have also a couple bug fixes added unrelated to the original scope of this PR; don't mind those too much—they work.) |
sangstar
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another review and another approval. :)
LGTM and working on my end. Not opinionated on the HTTP vs HTTPS default handling and the implementation looks good so I will defer that once again to Andrew.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems like this is behaving like I expect it to, granted not 100% validated e2e.
Object storage config handling improvements & improved CAIOS support
These changes improve the compatibility of the
stream_io.open_stream()object storage feature with different types of object storage configurations. The changes are summarized in the additions toCHANGELOG.md, duplicated below.This also fixes a bug with very large writes on some non-Linux platforms, and resolves an incomplete fix for
open_streamobject storage uploads from #72.Added
stream_io.open_stream()now respects Boto3's configuration files and environment variables when searching for object storage credentials to useFixed
stream_io.open_stream()now uses virtual-hosted-style bucket addressing for thecwobject.comandcwlota.comendpointsstream_io.open_stream()now allows theuse_httpsentry of.s3cfgconfiguration files to fill in itsforce_httpparameter ifforce_httpis not explicitly specified asTrueorFalseTensorSerializerno longer throws an error when attempting to serialize very large tensors on some non-Linux platformsstream_io.open_stream()now finalize correctly on Python 3.12+ even without an explicit call to theirclose()method