Skip to content

Conversation

Eden-D-Zhang
Copy link
Contributor

@Eden-D-Zhang Eden-D-Zhang commented Mar 4, 2025

Description

This PR introduces significant improvements to AWS authentication handling and other improvements to the CLP package. These changes include:

  1. Flexible AWS Authentication
    • Introduced the AwsAuthentication class within the existing S3Config to support multiple authentication methods:
      • EC2 instance metadata (ec2)
      • Environment variables (env_vars)
      • AWS configuration profile (profile)
      • Explicit credentials (credentials)
    • Added the aws_config_directory field to CLPConfig to specify the location of the .aws where profiles are configured (defaults to ~/.aws).
    • Added functionality to provide credential sources to components depending on authentication type:
      • aws_config_dir Docker mount for authentication through configured profiles.
      • Modified the S3Manager class in the log_viewer_webui component to support authentication through profiles.
      • Environmental variable handling for tasks that cannot authenticate dynamically.
    • Refactored existing authentication control flow to eliminate redundant code.
  2. Other Improvements
    • Refactored the S3InputConfig class to inherit S3Config class to allow reusing similar code.
    • Refactored handling of docker mounts and environmental variables in start_clp.py:
      • Created append_docker_mounts and append_docker_env_vars functions to simplify construction of Docker container start commands.
      • Removed hardcoded environmental variables in Docker start commands to make use of aforementioned append_docker_env_vars function and improve maintainabiility.
      • These changes help with method-specific and component-specific AWS authentication.

Checklist

  • The PR satisfies the contribution guidelines.
  • This is a breaking change and that has been indicated in the PR title, OR this isn't a
    breaking change.
  • Necessary docs have been updated, OR no docs need to be updated.

Validation performed

Manually verified compression, search, and log viewer both locally and on EC2 using all combinations of authentication methods.

Summary by CodeRabbit

  • New Features

    • Added support for multiple AWS authentication methods including EC2, environment variables, profiles, and credentials with session tokens.
    • Enabled mounting of host AWS configuration directories into Docker containers for seamless AWS credential access.
    • Log Viewer Web UI now supports AWS profile specification for S3 access.
  • Improvements

    • Centralized AWS authentication management across components, simplifying credential handling.
    • Streamlined Docker container startup by consolidating environment variable and mount option configuration.
    • Enhanced configuration files and UI to accommodate new AWS authentication options.
  • Documentation

    • Updated configuration templates and settings to incorporate new AWS authentication structures and directory options.
    • Revised user guides to explain expanded AWS authentication methods and updated IAM policy requirements.

Copy link
Contributor

coderabbitai bot commented Mar 4, 2025

Warning

Rate limit exceeded

@Eden-D-Zhang has exceeded the limit for the number of commits or files that can be reviewed per hour. Please wait 5 minutes and 46 seconds before requesting another review.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

📥 Commits

Reviewing files that changed from the base of the PR and between b56ac35 and c9f2c07.

📒 Files selected for processing (1)
  • components/clp-py-utils/clp_py_utils/s3_utils.py (4 hunks)

Walkthrough

This set of changes implements a comprehensive refactor and extension of AWS authentication handling across several components of the codebase. A new AwsAuthentication model is introduced to standardize and validate AWS authentication methods, supporting profiles, credentials, environment variables, and EC2 roles. The S3 configuration classes and logic are updated to use this new model, and support for session tokens is added. Docker container orchestration scripts are refactored to centralize the injection of environment variables and mounts, especially for AWS credentials and configuration directories. The log viewer web UI and its server plugins are updated to optionally accept and use an AWS profile. Documentation and configuration templates are updated to reflect these new options.

Changes

File(s) Change Summary
components/clp-package-utils/clp_package_utils/general.py Added an optional aws_config_dir attribute to CLPDockerMounts for AWS config directory mounting; updated container config generation to mount the host's AWS config directory to /.aws in the container if the directory exists.
components/clp-package-utils/clp_package_utils/scripts/compress.py Removed the unused import of CLPConfig from clp_py_utils.clp_config.
components/clp-package-utils/clp_package_utils/scripts/native/compress.py Changed the keyword argument from credentials to aws_authentication when constructing S3InputConfig.
components/clp-package-utils/clp_package_utils/scripts/start_clp.py Introduced helper function append_docker_options to consolidate appending Docker mount and environment variable options; refactored multiple container start functions to use this helper; improved AWS authentication handling by integrating generate_container_auth_options for scheduler, worker, and log viewer web UI containers; enhanced AWS profile handling in log viewer web UI to set profile in settings or environment variables accordingly; added AwsAuthType enum usage and clp_config.validate_aws_config_dir() call; reordered and cleaned environment variable and mount injections for consistency.
components/clp-py-utils/clp_py_utils/clp_config.py Added AwsAuthType enum and AwsAuthentication Pydantic model with validation for AWS auth types; extended S3Credentials with optional session_token; replaced credentials with aws_authentication in S3Config; added optional aws_config_directory path field to CLPConfig with validation method to enforce presence based on profile usage; updated serialization to include aws_config_directory; standardized string equality checks.
components/clp-py-utils/clp_py_utils/s3_utils.py Added functions _get_session_credentials, get_credential_env_vars, generate_container_auth_options, and _create_s3_client to centralize AWS credential environment variable handling and S3 client creation supporting all AWS auth types; updated s3_get_object_metadata and s3_put to use new client creation logic; removed total_max_attempts from s3_put; added validation and error handling for unsupported or missing AWS auth info.
components/job-orchestration/job_orchestration/executor/compress/compression_task.py Renamed make_clp_command_and_env and make_clp_s_command_and_env to private functions; replaced manual AWS credential environment variable setting with call to get_credential_env_vars for S3 authentication.
components/job-orchestration/job_orchestration/executor/query/extract_stream_task.py Updated imports and replaced manual AWS credential environment variable construction with get_credential_env_vars call for S3 storage case.
components/job-orchestration/job_orchestration/executor/query/fs_search_task.py Reformatted imports and replaced manual AWS credential environment variable construction with get_credential_env_vars call.
components/job-orchestration/job_orchestration/scheduler/job_config.py Changed S3InputConfig to inherit from S3Config instead of BaseModel; removed explicit S3 fields now inherited; updated imports to use AwsAuthentication and S3Config.
components/log-viewer-webui/server/settings.json Added new optional configuration key "StreamFilesS3Profile" with default null for AWS profile selection.
components/log-viewer-webui/server/src/app.ts Modified S3Manager plugin registration to include the optional profile parameter from settings.
components/log-viewer-webui/server/src/plugins/S3Manager.ts Updated S3Manager constructor and plugin to accept and handle an optional profile parameter; included profile in S3 client initialization and log messages.
components/package-template/src/etc/clp-config.yml Added a commented-out aws_config_directory configuration option with a placeholder default path.

Sequence Diagram(s)

sequenceDiagram
    participant User
    participant CLI/Script
    participant Docker
    participant Container
    participant AWS

    User->>CLI/Script: Start component (e.g., compression worker)
    CLI/Script->>CLI/Script: Load CLPConfig (incl. aws_authentication, aws_config_directory)
    CLI/Script->>CLI/Script: generate_container_auth_options()
    CLI/Script->>Docker: Build Docker command with env vars & mounts (incl. AWS creds/config)
    Docker->>Container: Start with provided env vars and mounts
    Container->>AWS: Access S3 using provided credentials/profile/config
Loading

Possibly related PRs

Suggested reviewers

  • junhaoliao
  • kirkrodrigues
  • haiqi96
✨ Finishing Touches
  • 📝 Generate Docstrings

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Generate unit testing code for this file.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai generate unit testing code for this file.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and generate unit testing code.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai generate docstrings to generate docstrings for this PR.
  • @coderabbitai generate sequence diagram to generate a sequence diagram of the changes in this PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link
Contributor

@haiqi96 haiqi96 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general, I think your apporach makes sense. we need to agree on where to put the .aws directory config in the config file.

Please also have a table where each row as a authentication method supported by boto3, with the following columns, so we can see the big picture

  1. Whether supported by clp package or not
  2. How should user specify such credentials in boto3 (and maybe they don't need to do anything)
  3. How should user specify such authentication info in clp package
  4. How CLP supports the method (i.e. via env, or mount)
  5. Any extra note(for example, webui viewer support)

It's also worthwhile to think about:
On failure, which component will run into the error and what user will see

@Eden-D-Zhang
Copy link
Contributor Author

Ignore latest commit doesn't work atm

Copy link
Contributor

@haiqi96 haiqi96 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Haven't looked at start_clp.py carefully,. but let me leave the first batch of comments

Co-authored-by: haiqi96 <14502009+haiqi96@users.noreply.github.com>
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (7)
docs/src/user-guide/guides-using-object-storage/clp-config.md (7)

22-24: Add context for aws_config_directory with profile authentication
The logs_input.s3_config snippet shows only the aws_authentication block. It would help to include a brief inline reminder (or comment) that when using the profile type, an aws_config_directory top-level key must also be set.


32-33: Rephrase link for readability
Minor nit: consider rewording the bullet for smoother flow, for example:

-* `<type>` and the type-specific settings are described in the [configuring AWS authentication](#configuring-aws-authentication) section.
+* See the [Configuring AWS authentication](#configuring-aws-authentication) section for type-specific settings.

50-52: Add context for aws_config_directory with profile authentication
Similarly, the archive_output.storage.s3_config example shows only aws_authentication. Including a note about setting aws_config_directory at the top level when using profile authentication would aid reader comprehension.


65-66: Rephrase link for consistency
Same nit as above: you might reword the bullet for consistency:

-* `<type>` and the type-specific settings are described in the [configuring AWS authentication](#configuring-aws-authentication) section.
+* See the [Configuring AWS authentication](#configuring-aws-authentication) section for type-specific settings.

83-85: Add context for aws_config_directory with profile authentication
In the stream_output.storage.s3_config example, a reminder to set aws_config_directory at the top level for the profile type would prevent readers from missing that requirement.


98-99: Rephrase link for consistency
A minor style suggestion to keep the phrasing consistent across bullets:

-* `<type>` and the type-specific settings are described in the [configuring AWS authentication](#configuring-aws-authentication) section.
+* See the [Configuring AWS authentication](#configuring-aws-authentication) section for type-specific settings.

106-116: Clarify nesting admonition and aws_config_directory scope
The note correctly warns that aws_authentication is shown at the top level but should be nested under *.s3_config. To avoid confusion with the separate aws_config_directory key (which is always top level), you could explicitly state that only the authentication block is illustrated top level here and that aws_config_directory remains a top-level sibling of logs_input, archive_output, etc.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 6c36fdb and c3d51f0.

📒 Files selected for processing (1)
  • docs/src/user-guide/guides-using-object-storage/clp-config.md (5 hunks)
🧰 Additional context used
🪛 LanguageTool
docs/src/user-guide/guides-using-object-storage/clp-config.md

[style] ~30-~30: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...s the bucket's name. * <key-prefix> is the prefix of all logs you wish to comp...

(ENGLISH_WORD_REPEAT_BEGINNING_RULE)


[style] ~152-~152: To form a complete sentence, be sure to include a subject or ‘there’.
Context: ... "" ``` <profile-name> should be the name of an existing [AWS CLI pro...

(MISSING_IT_THERE)

⏰ Context from checks skipped due to timeout of 90000ms (3)
  • GitHub Check: build (macos-latest)
  • GitHub Check: lint-check (macos-latest)
  • GitHub Check: lint-check (ubuntu-latest)
🔇 Additional comments (1)
docs/src/user-guide/guides-using-object-storage/clp-config.md (1)

118-118: Settings list is clear and consistent
The introduction "Settings for each type are described below:" and the subsequent list of authentication types is clear and follows the order of the detailed sections.

junhaoliao
junhaoliao previously approved these changes Apr 25, 2025
Co-authored-by: haiqi96 <14502009+haiqi96@users.noreply.github.com>
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (4)
components/clp-py-utils/clp_py_utils/s3_utils.py (4)

27-46: Function implementation looks good, but could improve error handling

The function correctly retrieves credentials from a boto3 session, but it doesn't handle specific boto3 exceptions that might occur (like ProfileNotFound). Also, the docstring should clarify what "None" return means.

 def _get_session_credentials(aws_profile: Optional[str] = None) -> Optional[S3Credentials]:
     """
     Generates AWS credentials created by boto3 when starting a session with given profile.
     :param aws_profile: Name of profile configured in ~/.aws directory
     :return: An S3Credentials object with access key pair and session token if applicable.
+            Returns None if credentials cannot be retrieved (e.g., missing profile, 
+            insufficient permissions, or expired credentials).
+    :raises: ProfileNotFound if the specified profile does not exist
     """
     aws_session: Optional[boto3.Session] = None
-    if aws_profile is not None:
-        aws_session = boto3.Session(profile_name=aws_profile)
-    else:
-        aws_session = boto3.Session()
+    try:
+        if aws_profile is not None:
+            aws_session = boto3.Session(profile_name=aws_profile)
+        else:
+            aws_session = boto3.Session()
+    except boto3.exceptions.ProfileNotFound:
+        raise ValueError(f"AWS profile '{aws_profile}' not found")
+        
     credentials = aws_session.get_credentials()
     if credentials is None:
         return None

90-153: Well-structured container auth options function

The function correctly handles different component types and authentication methods, with appropriate validations and error handling. This implementation aligns well with the PR objectives for supporting various AWS authentication methods.

Consider adding more inline comments to explain the logic flow, particularly about when AWS config mounting is needed versus when environment variables are used.


156-180: Good implementation of S3 client creation

The function handles all the required authentication types and creates the appropriate boto3 session and client.

For consistency with other functions in the boto3 ecosystem, consider renaming to _get_s3_client instead of _create_s3_client.


230-252: Good refactoring of s3_put with retry configuration

The function now uses the centralized S3 client creation and adds retry configuration with appropriate defaults.

Consider making the retry parameters configurable through additional optional parameters instead of hardcoding "3" attempts.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between c3d51f0 and 3c7492c.

📒 Files selected for processing (1)
  • components/clp-py-utils/clp_py_utils/s3_utils.py (4 hunks)
🧰 Additional context used
🧬 Code Graph Analysis (1)
components/clp-py-utils/clp_py_utils/s3_utils.py (3)
components/job-orchestration/job_orchestration/scheduler/job_config.py (1)
  • S3InputConfig (31-34)
components/clp-py-utils/clp_py_utils/clp_config.py (4)
  • AwsAuthType (59-63)
  • S3Config (372-396)
  • S3Credentials (322-337)
  • StorageType (54-56)
components/clp-py-utils/clp_py_utils/compression.py (1)
  • FileMetadata (10-25)
⏰ Context from checks skipped due to timeout of 90000ms (3)
  • GitHub Check: lint-check (macos-latest)
  • GitHub Check: lint-check (ubuntu-latest)
  • GitHub Check: build (macos-latest)
🔇 Additional comments (2)
components/clp-py-utils/clp_py_utils/s3_utils.py (2)

49-87: Comprehensive implementation of credential env vars function

The function handles all authentication types described in the PR objectives (credentials, profile, ec2, env_vars) and correctly manages session tokens. The code is well-structured with appropriate error handling.


209-209: LGTM - Good refactoring to use centralized client creation

Using the new _create_s3_client function improves code reuse and consistency in authentication handling.

junhaoliao
junhaoliao previously approved these changes Apr 25, 2025
@junhaoliao
Copy link
Member

The webui changes lgtm

Copy link
Contributor

@haiqi96 haiqi96 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code change looks good to me. Added one more suggestion to the doc.

kirkrodrigues and others added 2 commits April 25, 2025 12:27
Co-authored-by: haiqi96 <14502009+haiqi96@users.noreply.github.com>
@kirkrodrigues kirkrodrigues changed the title feat(clp-package): Add support for other AWS authentication methods feat(clp-package): Add support for more AWS authentication methods. Apr 25, 2025
haiqi96
haiqi96 previously approved these changes Apr 25, 2025
Copy link
Contributor

@haiqi96 haiqi96 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (3)
components/clp-py-utils/clp_py_utils/s3_utils.py (3)

32-52: Simplify session creation logic.

The function works correctly, but session creation can be simplified.

def _get_session_credentials(aws_profile: Optional[str] = None) -> Optional[S3Credentials]:
    """
    Generates AWS credentials created by boto3 when starting a session with given profile.
    :param aws_profile: Name of profile configured in ~/.aws directory
    :return: An S3Credentials object with access key pair and session token if applicable.
    """
-    aws_session: Optional[boto3.Session] = None
-    if aws_profile is not None:
-        aws_session = boto3.Session(profile_name=aws_profile)
-    else:
-        aws_session = boto3.Session()
+    aws_session = boto3.Session(profile_name=aws_profile)
    credentials = aws_session.get_credentials()
    if credentials is None:
        return None

54-92: Clean implementation of credential environment variables generation.

The function appropriately handles different authentication types and properly handles errors with descriptive messages.

There's a small inconsistency in string formatting on line 91:

    if aws_session_token is not None:
-        env_vars[f"{AWS_ENV_VAR_SESSION_TOKEN}"] = aws_session_token
+        env_vars[AWS_ENV_VAR_SESSION_TOKEN] = aws_session_token

95-160: Well-structured container authentication options generation.

The function properly generates the necessary authentication options based on component type and storage configuration.

Consider adding an early return when no S3 storages are found for better readability:

    for storage in storages_by_component_type:
        if StorageType.S3 == storage.type:
            auth = storage.s3_config.aws_authentication
            if AwsAuthType.profile == auth.type:
                config_mount = True
            elif AwsAuthType.env_vars == auth.type:
                add_env_vars = True

+    if not config_mount and not add_env_vars:
+        return (False, [])

    credentials_env_vars = []
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 92c6ab6 and b56ac35.

📒 Files selected for processing (1)
  • components/clp-py-utils/clp_py_utils/s3_utils.py (4 hunks)
⏰ Context from checks skipped due to timeout of 90000ms (2)
  • GitHub Check: lint-check (ubuntu-latest)
  • GitHub Check: lint-check (macos-latest)
🔇 Additional comments (6)
components/clp-py-utils/clp_py_utils/s3_utils.py (6)

1-4: Imports and type hints look good.

The addition of the typing imports and the switch to using constants for AWS environment variables are solid improvements.


9-22: Clean import structure from clp_config.

The explicit imports of specific classes and constants from clp_config is a good practice, making dependencies clear.


27-29: Good use of constants for AWS environment variables.

Using constants for AWS environment variables helps prevent typos and makes the code more maintainable.


162-186: Well-implemented S3 client creation with proper authentication handling.

The function handles different authentication types correctly and provides clear error messages for unsupported types.


215-215: Good use of centralized client creation.

Replacing direct client creation with the new helper function improves consistency.


236-258: Improved S3 put implementation.

Moving the retry configuration inline and using the centralized client creation function improves code consistency.

kirkrodrigues
kirkrodrigues previously approved these changes Apr 25, 2025
Copy link
Member

@kirkrodrigues kirkrodrigues left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I edited the PR title slightly. Let me know if y'all disagree.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants