-
Notifications
You must be signed in to change notification settings - Fork 83
feat!(clp-package): Add support for providing multiple S3 object keys as input for compression. #1383
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Important Review skippedDraft detected. Please check the settings in the CodeRabbit UI or the You can disable this status message by setting the ✨ Finishing touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
args_parser.add_argument( | ||
"--s3-single-prefix", | ||
action="store_true", | ||
help=( | ||
"Treat the S3 URL as a single prefix. If set, only a single S3 URL should be provided" | ||
" and it must be explicitly given as a positional argument." | ||
), | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please check whether this flag name makes sense @kirkrodrigues @junhaoliao
|
||
class S3InputConfig(S3Config): | ||
type: Literal[InputType.S3.value] = InputType.S3.value | ||
keys: Optional[List[str]] = None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please check whether passing keys in this way makes sense @kirkrodrigues @junhaoliao.
In eithere key ingestion or prefix ingestion, the key prefix will be required so S3Config
base is untouched.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(I haven't taken a very careful look at the implementation. sorry if i misunderstood anything.)
In what case do we expect the keys
to be None
and when not? I'm guessing that
1. Non-None and Non-empty: When the input config is created by an S3 scanner, or (to be implemented, ) when the user provides multiple s3 keys when submitting a compression job.
2. None: when the user only provides a single s3 prefix when submitting a compression job
Since this is a list where a length of 0 is allowed, do you think we can make it non-optional?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're right. But actually length of 0 is not allowed. With Pydantic, shall we make a validator?
Description
Before this PR, the CLP package compression only accepted a single S3 URL as the input. The S3 URL is decomposed into a bucket, a region code, and a key prefix. Everything under the key prefix will be compressed.
This PR adds support for providing multiple S3 object keys as input for compression. Now, users can specify a list of S3 URLs, where each URL is treated as a URL to an actual key in the bucket. Key-prefix-based ingestion is still supported, however, users must explicitly specify that the input is a prefix by using
--s3-single-prefix
option.The current implementation has the following requirements for the given input URLs:
To support multi-keys, we update
S3InputConfig
to store an optionalkeys
field. If this field is not set, prefix-based ingestion will be used as before. Otherwise, we will traverse the bucket to collect object metadata of the given keys to create compression jobs.Checklist
breaking change.
Validation performed