Skip to content

Conversation

haiqi96
Copy link
Contributor

@haiqi96 haiqi96 commented Jan 3, 2025

Description

This PR adds s3 ingestion support for clp-s package. Currently, we support compressing with one s3_url(with prefix matching) at a time. User can either specify the credentials through commandline, or via a credentials file. If no credentials are provided, the package will attempt to use the associated IAM if there is any.

The PR contains the following changes:

  1. Introduces S3InputConfig and FsInputConfig for different ingestion path. Updated compression execution script to handle different input.
  2. Updated compression scheduler to handle scheduling for s3 objects. Note to avoid excessive change, we let s3 objects reuse the existing metadata class for FilePath.
  3. Updated compress.sh interface to allow specifying if the input is from local filesystem (fs) or s3(s3). Simplified native/compress.sh scripts' interface to always accept a target_list.
  4. Update execution container to install necessary dependency for curl lib

The design decisions involved in this PR is documented in this doc

Due to time limitation, we leave the following items as todo for future:

If users provide invalid S3 information, such as incorrect credentials or a non-existent bucket, the issue is only detected when the package attempts to schedule the job. Additionally, if the provided S3 credentials lack sufficient privileges (e.g., list_object is allowed, but get_object is not), the package will return unhelpful error messages to the web UI ("see error log.txt blabla"), making it confusing for user.

Ideally, invalid S3 configurations should be detected at package startup to provide immediate and actionable feedback.

Validation performed

Manually verified that both fs compression and s3 compression works from the commandline.

Validations done after code review:
Launched compression successfully with

  1. FS input, with --path-list that contains 2 files.
  2. FS input, with PATH positional argument
  3. Valid S3 URL with aws_key and id specified through argument
  4. Valid S3 URL with aws_key and id specified through credential file

Launched compression with the following configuration and confirmed that errors is properly reported:

  1. Valid S3 URL with invalid credential file.
  2. Invalid S3 URL that doesn't resolve to any objects
  3. Invalid S3 URL with unexisting bucket
  4. S3 URL with bucket that can't be accessed.
  5. S3 URL with no permission.

Also sanity checked the error for writing to S3

  1. Setting the destination to a path without write permission. verified the error message makes sense.
  2. Setting the archive_output with invalid s3 credentials. verified the error message makes sense.

Summary by CodeRabbit

Release Notes

  • New Features

    • Added support for compressing files from both filesystem and Amazon S3 sources.
    • Introduced new functions for generating log lists for filesystem and S3 inputs.
    • Enhanced configuration management with new input types for compression tasks.
  • Improvements

    • Refined input validation and error handling for compression processes.
    • Improved task routing and execution for compression jobs.
    • Streamlined command generation for S3 operations and enhanced metadata handling.
  • Infrastructure

    • Updated Docker base images with additional certificate packages.
  • Bug Fixes

    • Resolved issues with path resolution and input type handling.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

♻️ Duplicate comments (1)
components/clp-package-utils/clp_package_utils/scripts/compress.py (1)

91-102: ⚠️ Potential issue

Use environment variables for AWS credentials.

Passing AWS credentials as command-line arguments is insecure as they may be visible in process listings and shell history.

🧹 Nitpick comments (2)
components/clp-package-utils/clp_package_utils/scripts/compress.py (2)

234-242: Use tempfile module for generating temporary files.

Instead of using a while loop to find an unused filename, use Python's tempfile module which handles this safely.

Apply this diff to use tempfile:

+    import tempfile
-    # Write compression logs to a file
-    while True:
-        # Get unused output path
-        container_logs_list_filename = f"{uuid.uuid4()}.txt"
-        container_logs_list_path = clp_config.logs_directory / container_logs_list_filename
-        logs_list_path_on_container = (
-            container_clp_config.logs_directory / container_logs_list_filename
-        )
-        if not container_logs_list_path.exists():
-            break
+    with tempfile.NamedTemporaryFile(suffix='.txt', dir=clp_config.logs_directory, delete=False) as temp_file:
+        container_logs_list_path = pathlib.Path(temp_file.name)
+        container_logs_list_filename = container_logs_list_path.name
+        logs_list_path_on_container = container_clp_config.logs_directory / container_logs_list_filename

191-191: Improve S3 URL help message.

The help message should indicate that the URL can be a prefix and provide an example format.

Apply this diff to improve the help message:

-    s3_compressor_parser.add_argument("url", metavar="URL", help="URL of objects to be compressed")
+    s3_compressor_parser.add_argument("url", metavar="URL", help="S3 URL of objects to be compressed (e.g., s3://bucket/prefix)")
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 99ea6dd and 996692e.

📒 Files selected for processing (1)
  • components/clp-package-utils/clp_package_utils/scripts/compress.py (4 hunks)
⏰ Context from checks skipped due to timeout of 90000ms (2)
  • GitHub Check: lint-check (ubuntu-latest)
  • GitHub Check: lint-check (macos-latest)
🔇 Additional comments (1)
components/clp-package-utils/clp_package_utils/scripts/compress.py (1)

249-251: Verify S3 URL format before compression.

Add validation for the S3 URL format to catch invalid URLs early.

Run the following script to check S3 URL usage in the codebase:

Comment on lines 94 to 97
default_credentials_user = "default"
aws_access_key_id, aws_secret_access_key = parse_aws_credentials_file(
pathlib.Path(parsed_args.aws_credentials_file, default_credentials_user)
)
Copy link
Contributor

@coderabbitai coderabbitai bot Jan 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Fix incorrect usage of pathlib.Path for AWS credentials file.

The pathlib.Path constructor is being used incorrectly. The second argument default_credentials_user should not be passed to Path as it's not part of the file path.

Apply this diff to fix the issue:

-            default_credentials_user = "default"
             aws_access_key_id, aws_secret_access_key = parse_aws_credentials_file(
-                pathlib.Path(parsed_args.aws_credentials_file, default_credentials_user)
+                pathlib.Path(parsed_args.aws_credentials_file)
             )
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
default_credentials_user = "default"
aws_access_key_id, aws_secret_access_key = parse_aws_credentials_file(
pathlib.Path(parsed_args.aws_credentials_file, default_credentials_user)
)
aws_access_key_id, aws_secret_access_key = parse_aws_credentials_file(
pathlib.Path(parsed_args.aws_credentials_file)
)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for catching this.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@haiqi96 You're welcome! Let me know if you need anything else.


If you found this review helpful, would you consider giving us a shout-out on X?
Thank you for using CodeRabbit!

Comment on lines 51 to 56
for path in parsed_args.paths:
resolved_path = pathlib.Path(path).resolve()
mounted_path = CONTAINER_INPUT_LOGS_ROOT_DIR / pathlib.Path(
resolved_path
).relative_to(resolved_path.anchor)
container_logs_list_file.write(f"{mounted_path}\n")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Add similar error handling for command-line paths.

The same error handling should be applied to paths provided via command-line arguments.

Apply this diff to add error handling:

-                resolved_path = pathlib.Path(path).resolve()
-                mounted_path = CONTAINER_INPUT_LOGS_ROOT_DIR / pathlib.Path(
-                    resolved_path
-                ).relative_to(resolved_path.anchor)
+                try:
+                    resolved_path = pathlib.Path(path).resolve()
+                    if not resolved_path.exists():
+                        logger.warning(f"Path does not exist: {path}")
+                        continue
+                    mounted_path = CONTAINER_INPUT_LOGS_ROOT_DIR / pathlib.Path(
+                        resolved_path
+                    ).relative_to(resolved_path.anchor)
+                except (OSError, RuntimeError) as e:
+                    logger.error(f"Failed to resolve path {path}: {e}")
+                    continue
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
for path in parsed_args.paths:
resolved_path = pathlib.Path(path).resolve()
mounted_path = CONTAINER_INPUT_LOGS_ROOT_DIR / pathlib.Path(
resolved_path
).relative_to(resolved_path.anchor)
container_logs_list_file.write(f"{mounted_path}\n")
for path in parsed_args.paths:
try:
resolved_path = pathlib.Path(path).resolve()
if not resolved_path.exists():
logger.warning(f"Path does not exist: {path}")
continue
mounted_path = CONTAINER_INPUT_LOGS_ROOT_DIR / pathlib.Path(
resolved_path
).relative_to(resolved_path.anchor)
except (OSError, RuntimeError) as e:
logger.error(f"Failed to resolve path {path}: {e}")
continue
container_logs_list_file.write(f"{mounted_path}\n")

Copy link
Member

@kirkrodrigues kirkrodrigues left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor comments.

For the PR title, how about:

feat(clp-package): Add support for clp-json to ingest logs from S3.

Co-authored-by: kirkrodrigues <2454684+kirkrodrigues@users.noreply.github.com>
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (3)
components/clp-py-utils/clp_py_utils/s3_utils.py (3)

1-14: Organize imports by standard library, third-party, and local packages.

Consider reorganizing the imports into three groups separated by a blank line:

  1. Standard library imports (configparser, re, pathlib, typing)
  2. Third-party imports (boto3, botocore)
  3. Local imports (job_orchestration, clp_py_utils)
 import configparser
 import re
 from pathlib import Path
 from typing import List, Tuple
+
 import boto3
 from botocore.config import Config
+
 from job_orchestration.scheduler.job_config import S3InputConfig
 from clp_py_utils.clp_config import S3Config
 from clp_py_utils.compression import FileMetadata

20-27: Enhance docstring with specific return value types.

The docstring should specify the exact types being returned and clarify that the values are strings.

     """
     Parses the `aws_access_key_id` and `aws_secret_access_key` of `user` from the given
     credentials_file_path.
     :param credentials_file_path:
     :param user:
-    :return: A tuple of (aws_access_key_id, aws_secret_access_key)
+    :return: A tuple of (str, str) containing (aws_access_key_id, aws_secret_access_key)
     :raises: ValueError if the file doesn't exist, or doesn't contain valid aws credentials.
     """

74-75: Improve error message with URL format examples.

The error message should include examples of valid URL formats to help users understand the expected format.

     if match is None:
-        raise ValueError(f"Unsupported URL format: {s3_url}")
+        raise ValueError(
+            f"Unsupported URL format: {s3_url}. Expected formats:\n"
+            "- Host-style: https://<bucket>.s3.<region>.amazonaws.com/<key>\n"
+            "- Path-style: https://s3.<region>.amazonaws.com/<bucket>/<key>"
+        )
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 996692e and 351f239.

📒 Files selected for processing (4)
  • components/clp-package-utils/clp_package_utils/scripts/compress.py (4 hunks)
  • components/clp-package-utils/clp_package_utils/scripts/native/compress.py (4 hunks)
  • components/clp-py-utils/clp_py_utils/s3_utils.py (2 hunks)
  • components/job-orchestration/job_orchestration/executor/compress/compression_task.py (11 hunks)
👮 Files not reviewed due to content moderation or server errors (3)
  • components/clp-package-utils/clp_package_utils/scripts/native/compress.py
  • components/clp-package-utils/clp_package_utils/scripts/compress.py
  • components/job-orchestration/job_orchestration/executor/compress/compression_task.py
⏰ Context from checks skipped due to timeout of 90000ms (2)
  • GitHub Check: lint-check (ubuntu-latest)
  • GitHub Check: lint-check (macos-latest)
🔇 Additional comments (3)
components/clp-py-utils/clp_py_utils/s3_utils.py (3)

88-98: LGTM! Input validation is thorough.

The function properly validates all required parameters before generating the URL.


101-137: LGTM! Pagination and metadata handling are well implemented.

The function correctly handles pagination and object metadata retrieval, with appropriate filtering for directory-like paths.


Line range hint 140-176: LGTM! Error handling and retry configuration are well implemented.

The function includes appropriate validation, retry configuration, and clear error messages.

@haiqi96 haiqi96 changed the title feat(clp-package): Add support for clp-s s3 ingestion feat(clp-package): Add support for clp-json to ingest logs from S3. Jan 16, 2025
@haiqi96 haiqi96 requested a review from kirkrodrigues January 16, 2025 17:43
@haiqi96 haiqi96 merged commit fae7fcc into y-scope:main Jan 16, 2025
9 checks passed
@haiqi96 haiqi96 deleted the s3_scheduler branch June 16, 2025 20:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants