-
Notifications
You must be signed in to change notification settings - Fork 81
feat(clp-package): Add support for clp-json to ingest logs from S3. #651
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
…ss/fs_compression_task.py
Co-authored-by: kirkrodrigues <2454684+kirkrodrigues@users.noreply.github.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 3
♻️ Duplicate comments (1)
components/clp-package-utils/clp_package_utils/scripts/compress.py (1)
91-102
:⚠️ Potential issueUse environment variables for AWS credentials.
Passing AWS credentials as command-line arguments is insecure as they may be visible in process listings and shell history.
🧹 Nitpick comments (2)
components/clp-package-utils/clp_package_utils/scripts/compress.py (2)
234-242
: Usetempfile
module for generating temporary files.Instead of using a while loop to find an unused filename, use Python's
tempfile
module which handles this safely.Apply this diff to use
tempfile
:+ import tempfile - # Write compression logs to a file - while True: - # Get unused output path - container_logs_list_filename = f"{uuid.uuid4()}.txt" - container_logs_list_path = clp_config.logs_directory / container_logs_list_filename - logs_list_path_on_container = ( - container_clp_config.logs_directory / container_logs_list_filename - ) - if not container_logs_list_path.exists(): - break + with tempfile.NamedTemporaryFile(suffix='.txt', dir=clp_config.logs_directory, delete=False) as temp_file: + container_logs_list_path = pathlib.Path(temp_file.name) + container_logs_list_filename = container_logs_list_path.name + logs_list_path_on_container = container_clp_config.logs_directory / container_logs_list_filename
191-191
: Improve S3 URL help message.The help message should indicate that the URL can be a prefix and provide an example format.
Apply this diff to improve the help message:
- s3_compressor_parser.add_argument("url", metavar="URL", help="URL of objects to be compressed") + s3_compressor_parser.add_argument("url", metavar="URL", help="S3 URL of objects to be compressed (e.g., s3://bucket/prefix)")
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
components/clp-package-utils/clp_package_utils/scripts/compress.py
(4 hunks)
⏰ Context from checks skipped due to timeout of 90000ms (2)
- GitHub Check: lint-check (ubuntu-latest)
- GitHub Check: lint-check (macos-latest)
🔇 Additional comments (1)
components/clp-package-utils/clp_package_utils/scripts/compress.py (1)
249-251
: Verify S3 URL format before compression.Add validation for the S3 URL format to catch invalid URLs early.
Run the following script to check S3 URL usage in the codebase:
default_credentials_user = "default" | ||
aws_access_key_id, aws_secret_access_key = parse_aws_credentials_file( | ||
pathlib.Path(parsed_args.aws_credentials_file, default_credentials_user) | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fix incorrect usage of pathlib.Path
for AWS credentials file.
The pathlib.Path
constructor is being used incorrectly. The second argument default_credentials_user
should not be passed to Path
as it's not part of the file path.
Apply this diff to fix the issue:
- default_credentials_user = "default"
aws_access_key_id, aws_secret_access_key = parse_aws_credentials_file(
- pathlib.Path(parsed_args.aws_credentials_file, default_credentials_user)
+ pathlib.Path(parsed_args.aws_credentials_file)
)
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
default_credentials_user = "default" | |
aws_access_key_id, aws_secret_access_key = parse_aws_credentials_file( | |
pathlib.Path(parsed_args.aws_credentials_file, default_credentials_user) | |
) | |
aws_access_key_id, aws_secret_access_key = parse_aws_credentials_file( | |
pathlib.Path(parsed_args.aws_credentials_file) | |
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for catching this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for path in parsed_args.paths: | ||
resolved_path = pathlib.Path(path).resolve() | ||
mounted_path = CONTAINER_INPUT_LOGS_ROOT_DIR / pathlib.Path( | ||
resolved_path | ||
).relative_to(resolved_path.anchor) | ||
container_logs_list_file.write(f"{mounted_path}\n") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Add similar error handling for command-line paths.
The same error handling should be applied to paths provided via command-line arguments.
Apply this diff to add error handling:
- resolved_path = pathlib.Path(path).resolve()
- mounted_path = CONTAINER_INPUT_LOGS_ROOT_DIR / pathlib.Path(
- resolved_path
- ).relative_to(resolved_path.anchor)
+ try:
+ resolved_path = pathlib.Path(path).resolve()
+ if not resolved_path.exists():
+ logger.warning(f"Path does not exist: {path}")
+ continue
+ mounted_path = CONTAINER_INPUT_LOGS_ROOT_DIR / pathlib.Path(
+ resolved_path
+ ).relative_to(resolved_path.anchor)
+ except (OSError, RuntimeError) as e:
+ logger.error(f"Failed to resolve path {path}: {e}")
+ continue
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
for path in parsed_args.paths: | |
resolved_path = pathlib.Path(path).resolve() | |
mounted_path = CONTAINER_INPUT_LOGS_ROOT_DIR / pathlib.Path( | |
resolved_path | |
).relative_to(resolved_path.anchor) | |
container_logs_list_file.write(f"{mounted_path}\n") | |
for path in parsed_args.paths: | |
try: | |
resolved_path = pathlib.Path(path).resolve() | |
if not resolved_path.exists(): | |
logger.warning(f"Path does not exist: {path}") | |
continue | |
mounted_path = CONTAINER_INPUT_LOGS_ROOT_DIR / pathlib.Path( | |
resolved_path | |
).relative_to(resolved_path.anchor) | |
except (OSError, RuntimeError) as e: | |
logger.error(f"Failed to resolve path {path}: {e}") | |
continue | |
container_logs_list_file.write(f"{mounted_path}\n") |
components/clp-package-utils/clp_package_utils/scripts/compress.py
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor comments.
For the PR title, how about:
feat(clp-package): Add support for clp-json to ingest logs from S3.
components/clp-package-utils/clp_package_utils/scripts/compress.py
Outdated
Show resolved
Hide resolved
components/clp-package-utils/clp_package_utils/scripts/compress.py
Outdated
Show resolved
Hide resolved
components/clp-package-utils/clp_package_utils/scripts/compress.py
Outdated
Show resolved
Hide resolved
components/clp-package-utils/clp_package_utils/scripts/native/compress.py
Outdated
Show resolved
Hide resolved
components/job-orchestration/job_orchestration/executor/compress/compression_task.py
Outdated
Show resolved
Hide resolved
components/clp-package-utils/clp_package_utils/scripts/compress.py
Outdated
Show resolved
Hide resolved
Co-authored-by: kirkrodrigues <2454684+kirkrodrigues@users.noreply.github.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (3)
components/clp-py-utils/clp_py_utils/s3_utils.py (3)
1-14
: Organize imports by standard library, third-party, and local packages.Consider reorganizing the imports into three groups separated by a blank line:
- Standard library imports (
configparser
,re
,pathlib
,typing
)- Third-party imports (
boto3
,botocore
)- Local imports (
job_orchestration
,clp_py_utils
)import configparser import re from pathlib import Path from typing import List, Tuple + import boto3 from botocore.config import Config + from job_orchestration.scheduler.job_config import S3InputConfig from clp_py_utils.clp_config import S3Config from clp_py_utils.compression import FileMetadata
20-27
: Enhance docstring with specific return value types.The docstring should specify the exact types being returned and clarify that the values are strings.
""" Parses the `aws_access_key_id` and `aws_secret_access_key` of `user` from the given credentials_file_path. :param credentials_file_path: :param user: - :return: A tuple of (aws_access_key_id, aws_secret_access_key) + :return: A tuple of (str, str) containing (aws_access_key_id, aws_secret_access_key) :raises: ValueError if the file doesn't exist, or doesn't contain valid aws credentials. """
74-75
: Improve error message with URL format examples.The error message should include examples of valid URL formats to help users understand the expected format.
if match is None: - raise ValueError(f"Unsupported URL format: {s3_url}") + raise ValueError( + f"Unsupported URL format: {s3_url}. Expected formats:\n" + "- Host-style: https://<bucket>.s3.<region>.amazonaws.com/<key>\n" + "- Path-style: https://s3.<region>.amazonaws.com/<bucket>/<key>" + )
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (4)
components/clp-package-utils/clp_package_utils/scripts/compress.py
(4 hunks)components/clp-package-utils/clp_package_utils/scripts/native/compress.py
(4 hunks)components/clp-py-utils/clp_py_utils/s3_utils.py
(2 hunks)components/job-orchestration/job_orchestration/executor/compress/compression_task.py
(11 hunks)
👮 Files not reviewed due to content moderation or server errors (3)
- components/clp-package-utils/clp_package_utils/scripts/native/compress.py
- components/clp-package-utils/clp_package_utils/scripts/compress.py
- components/job-orchestration/job_orchestration/executor/compress/compression_task.py
⏰ Context from checks skipped due to timeout of 90000ms (2)
- GitHub Check: lint-check (ubuntu-latest)
- GitHub Check: lint-check (macos-latest)
🔇 Additional comments (3)
components/clp-py-utils/clp_py_utils/s3_utils.py (3)
88-98
: LGTM! Input validation is thorough.The function properly validates all required parameters before generating the URL.
101-137
: LGTM! Pagination and metadata handling are well implemented.The function correctly handles pagination and object metadata retrieval, with appropriate filtering for directory-like paths.
Line range hint
140-176
: LGTM! Error handling and retry configuration are well implemented.The function includes appropriate validation, retry configuration, and clear error messages.
Description
This PR adds s3 ingestion support for clp-s package. Currently, we support compressing with one s3_url(with prefix matching) at a time. User can either specify the credentials through commandline, or via a credentials file. If no credentials are provided, the package will attempt to use the associated IAM if there is any.
The PR contains the following changes:
The design decisions involved in this PR is documented in this doc
Due to time limitation, we leave the following items as todo for future:
If users provide invalid S3 information, such as incorrect credentials or a non-existent bucket, the issue is only detected when the package attempts to schedule the job. Additionally, if the provided S3 credentials lack sufficient privileges (e.g., list_object is allowed, but get_object is not), the package will return unhelpful error messages to the web UI ("see error log.txt blabla"), making it confusing for user.
Ideally, invalid S3 configurations should be detected at package startup to provide immediate and actionable feedback.
Validation performed
Manually verified that both fs compression and s3 compression works from the commandline.
Validations done after code review:
Launched compression successfully with
Launched compression with the following configuration and confirmed that errors is properly reported:
Also sanity checked the error for writing to S3
Summary by CodeRabbit
Release Notes
New Features
Improvements
Infrastructure
Bug Fixes