feat: Added s3 support for mediaWiki ETL! #44

amindadgar · 2025-05-20T11:16:56Z

Summary by CodeRabbit

New Features
- Added integration with S3-compatible storage for persisting and retrieving ETL workflow data.
- Introduced a new storage client for interacting with S3, enabling storage of extracted and transformed data.
- Enhanced the ETL workflow to use S3 keys for data transfer between transformation and loading steps.
Chores
- Updated environment variable templates to support AWS/S3 configuration.
- Added MinIO service and healthcheck to development environment for local S3-compatible storage.
- Updated dependencies to include AWS SDK libraries (boto3, botocore).
- Improved .gitignore to exclude MinIO data and dump files.

coderabbitai · 2025-05-20T11:17:03Z

Walkthrough

The changes introduce S3-compatible storage integration using MinIO for the MediaWiki ETL workflow. Configuration templates and Docker Compose are updated to support MinIO, and a new S3 client is added for storing and retrieving intermediate data. The ETL activities and workflow are modified to persist and load data via S3, and new dependencies for AWS SDK libraries are included.

Changes

File(s)	Change Summary
`.env.example`	Added placeholders for AWS/MinIO-related environment variables: endpoint URL, access key, secret key, bucket, region, and secure flag.
`.gitignore`	Added `minio_data/` and `dumps/*` to ignored files/directories.
`docker-compose.dev.yml`	Introduced new `minio` and `minio-healthcheck` services; updated `worker` service to depend on MinIO and use S3 environment variables.
`hivemind_etl/mediawiki/activities.py`	Modified ETL activity functions to store transformed data in S3 and adjust function signatures to use S3 keys for data transfer.
`hivemind_etl/mediawiki/workflows.py`	Updated workflow to use S3 keys for passing data between activities and clarified comments regarding S3 storage.
`hivemind_etl/storage/s3_client.py`	Added new `S3Client` class for S3/MinIO integration, including methods for storing and retrieving ETL data.
`requirements.txt`	Added `boto3` and `botocore` as dependencies for AWS S3 integration.

Sequence Diagram(s)

sequenceDiagram
    participant Extractor as extract_mediawiki
    participant S3 as S3Client
    participant Transformer as transform_mediawiki_data
    participant Loader as load_mediawiki_data

    Extractor->>S3: store_extracted_data()
    Transformer->>S3: store_transformed_data()
    Transformer->>Extractor: return transformed_data_key
    Loader->>S3: get_data_by_key(transformed_data_key)
    S3-->>Loader: Return transformed data
    Loader->>Loader: Load data into target

Poem

In a warren of code, a new tunnel appears,
S3 and MinIO now calm our data fears.
With buckets and keys, our ETL hops,
Storing and fetching, no more sudden stops!
🐇✨
Boto3 in our paws, we leap with delight—
Our workflow now soars, data safe through the night.

Note

⚡️ AI Code Reviews for VS Code, Cursor, Windsurf

CodeRabbit now has a plugin for VS Code, Cursor and Windsurf. This brings AI code reviews directly in the code editor. Each commit is reviewed immediately, finding bugs before the PR is raised. Seamless context handoff to your AI code agent ensures that you can easily incorporate review feedback.
Learn more here.

Note

⚡️ Faster reviews with caching

CodeRabbit now supports caching for code and dependencies, helping speed up reviews. This means quicker feedback, reduced wait times, and a smoother review experience overall. Cached data is encrypted and stored securely. This feature will be automatically enabled for all accounts on May 16th. To opt out, configure Review - Disable Cache at either the organization or repository level. If you prefer to disable all data retention across your organization, simply turn off the Data Retention setting under your Organization Settings.
Enjoy the performance boost—your workflow just got faster.

✨ Finishing Touches

📝 Generate Docstrings

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

Review comments: Directly reply to a review comment made by CodeRabbit. Example:
- I pushed a fix in commit <commit_id>, please review it.
- Explain this complex logic.
- Open a follow-up GitHub issue for this discussion.
Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
- @coderabbitai explain this code block.
- @coderabbitai modularize this function.
PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
- @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
- @coderabbitai read src/utils.ts and explain its main purpose.
- @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
- @coderabbitai help me debug CodeRabbit configuration file.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

@coderabbitai pause to pause the reviews on a PR.
@coderabbitai resume to resume the paused reviews.
@coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
@coderabbitai full review to do a full review from scratch and review all the files again.
@coderabbitai summary to regenerate the summary of the PR.
@coderabbitai generate docstrings to generate docstrings for this PR.
@coderabbitai generate sequence diagram to generate a sequence diagram of the changes in this PR.
@coderabbitai resolve resolve all the CodeRabbit review comments.
@coderabbitai configuration to show the current CodeRabbit configuration for the repository.
@coderabbitai help to get help.

Other keywords and placeholders

Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (`.coderabbit.yaml`)

You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
Please see the configuration documentation for more information.
If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

Visit our Documentation for detailed information on how to use CodeRabbit.
Join our Discord Community to get help, request features, and share feedback.
Follow us on X/Twitter for updates and announcements.

coderabbitai

Actionable comments posted: 4

🔭 Outside diff range comments (1)

hivemind_etl/mediawiki/workflows.py (1)
61-82: 💡 Verification agent
🧩 Analysis chain

Verify error handling for S3 operations.

While the code structure looks good, S3 operations can fail for various reasons (network issues, permissions, etc.). The workflow has retry policies, but you might want to consider specific error handling for S3-related failures.

🏁 Script executed:
#!/bin/bash
# Check if there are specific S3 error handling mechanisms in the activities

echo "Checking S3 error handling in activities.py..."
rg -A 3 -B 3 "except.*boto|S3|AWS" hivemind_etl/mediawiki/activities.py

echo "Checking for S3 exception types in S3 client implementation..."
rg -A 2 "except" hivemind_etl/storage/s3_client.py
Length of output: 2692
Add Comprehensive S3 Failure Handling

While we already catch specific ClientError codes (404 bucket-not-found, NoSuchKey) in S3Client, there are other common S3/AWS failure modes that should be handled or surfaced to Temporal so your retry policies can take effect:

• hivemind_etl/storage/s3_client.py
– In store_transformed_data and get_data_by_key, add catches for:
• AccessDenied (403)
• Throttling (Throttling or ThrottlingException)
• Internal errors (500, ServiceUnavailable)
• Endpoint/network errors (EndpointConnectionError)
– For retryable errors, either let the exception bubble up (so the activity fails and Temporal retries) or implement a backoff/retry loop.
– For non-recoverable errors, log with clear messages and fail fast.

• hivemind_etl/mediawiki/activities.py
– The broad except Exception around your S3 calls can swallow root causes.
– Replace with narrower except ClientError as e: where you log the AWS error code/message and then raise to allow Temporal to retry per your RetryPolicy.
– Only catch and handle truly recoverable errors locally; let everything else propagate.

This will ensure you don’t silently drop permissions/network errors, and that Temporal’s retry policies actually trigger on S3 failures.

🧹 Nitpick comments (3)

hivemind_etl/storage/s3_client.py (1)
46-59: Optional: explicitly set use_ssl instead of overloading verify

You already derive self.secure from AWS_SECURE; consider passing
use_ssl=self.secure,
verify=self.secure,
to make the intention clear and avoid certificate-validation attempts when using http://.
hivemind_etl/mediawiki/activities.py (1)
112-121: Avoid repeated S3 client instantiation & propagate key downstream

S3Client() is instantiated on every activity; that’s unnecessary overhead and complicates unit
testing. Pass a client instance from the workflow context or make S3Client a lightweight
singleton.

Example:
s3_client = s3_client or S3Client()
Also ensure the returned transformed_key is injected into the args for load_mediawiki_data
(otherwise transformed_data_key will be missing).
docker-compose.dev.yml (1)
21-27: Do not store credentials in source-controlled compose files

Hard-coding AWS_ACCESS_KEY_ID / AWS_SECRET_ACCESS_KEY inside the compose file risks accidental
exposure. It is safer to reference only ${...} placeholders and keep defaults in .env, e.g.:
environment:
  - AWS_ENDPOINT_URL=${AWS_ENDPOINT_URL}
  - AWS_ACCESS_KEY_ID=${AWS_ACCESS_KEY_ID}
  - AWS_SECRET_ACCESS_KEY=${AWS_SECRET_ACCESS_KEY}

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
Cache: Disabled due to data retention organization setting
Knowledge Base: Disabled due to data retention organization setting

📥 Commits

Reviewing files that changed from the base of the PR and between 239cdb7 and 0cb92ea.

📒 Files selected for processing (7)

.env.example (1 hunks)
.gitignore (1 hunks)
docker-compose.dev.yml (2 hunks)
hivemind_etl/mediawiki/activities.py (5 hunks)
hivemind_etl/mediawiki/workflows.py (3 hunks)
hivemind_etl/storage/s3_client.py (1 hunks)
requirements.txt (1 hunks)

🧰 Additional context used

🧬 Code Graph Analysis (1)

hivemind_etl/mediawiki/activities.py (3)

hivemind_etl/storage/s3_client.py (3)

S3Client (13-145)

store_transformed_data (94-110)

get_data_by_key (112-122)

hivemind_etl/website/website_etl.py (1)

transform (71-99)

hivemind_etl/mediawiki/etl.py (1)

transform (48-96)

🪛 Ruff (0.11.9)

hivemind_etl/storage/s3_client.py

120-120: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling

(B904)

⏰ Context from checks skipped due to timeout of 90000ms (1)

GitHub Check: ci / build-push / Build + Push Image

🔇 Additional comments (8)

requirements.txt (1)

12-13: Good addition of AWS SDK dependencies for S3 integration.

Adding boto3 and botocore with minimum version constraints provides the necessary libraries for S3 integration. This aligns well with the PR's goal of adding S3 support for MediaWiki ETL.

Note that boto3 already includes botocore as a dependency, so specifying both is technically redundant but can be useful for version pinning.

.gitignore (1)

168-171: Appropriate additions to .gitignore for S3 storage.

The additions properly exclude:

dump_* files - temporary or output dump files

minio_data/ - local S3-compatible storage data directory

dumps/* - all files in the dumps directory

These exclusions prevent committing large data files to the repository, which is a good practice.

.env.example (1)

33-38: Well-structured AWS S3 configuration environment variables.

The AWS environment variables follow standard naming conventions and include all the necessary parameters for S3 configuration:

Endpoint URL for custom/MinIO endpoints

Access credentials

Bucket name

Region

Secure connection flag

This provides a good template for users setting up the application.

hivemind_etl/mediawiki/workflows.py (3)

49-51: Clear documentation of S3 storage implementation.

The updated comment correctly indicates that the extraction step now stores data in S3 rather than just extracting it, which helps other developers understand the workflow's data flow.

61-63: Good adaptation of transformation step for S3 integration.

The comment and return value handling are updated to reflect that transformed data is now stored in S3, with the activity returning a key reference instead of the actual data objects.

72-74: Proper data flow implementation for S3 integration.

The workflow now:

Stores the transformed data key in the mediawiki_platform dictionary

Updates the comment to clarify that data is loaded from S3

Passes the necessary reference information to the load activity

This approach properly decouples the data storage from the workflow logic.
hivemind_etl/storage/s3_client.py (1)

112-123: Preserve original stack-trace when re-raising & correct return typing

raise without from hides the originating exception context; Ruff flagged this (B904).
Additionally, get_data_by_key can return either a dict or a list (for transformed docs),
so the return type should be Any | None.
-            raise
+            raise e from None
and
-def get_data_by_key(self, key: str) -> Dict[str, Any]:
+from typing import Any, Optional
+
+def get_data_by_key(self, key: str) -> Optional[Any]:
Consider updating the docstring accordingly.
[ suggest_essential_refactor ]

🧰 Tools

🪛 Ruff (0.11.9)

120-120: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling

(B904)
docker-compose.dev.yml (1)

132-140: Future-dated MinIO image tag looks suspicious

minio/minio:RELEASE.2025-04-22T22-12-26Z references a future release date. Pinning to a
non-existent tag will break docker compose pull. Prefer the latest stable or an existing
SHA-pinned tag, e.g.:
image: minio/minio:RELEASE.2024-04-19T02-46-22Z
or simply minio/minio:latest for dev.

hivemind_etl/storage/s3_client.py

hivemind_etl/mediawiki/activities.py

coderabbitai

Actionable comments posted: 0

♻️ Duplicate comments (1)

hivemind_etl/storage/s3_client.py (1)

66-71: ⚠️ Potential issue

Fix bucket creation for us-east-1 region

AWS requires omitting CreateBucketConfiguration when the region is us-east-1; passing it causes a InvalidLocationConstraint (400) error. MinIO (used in dev) might ignore this field, but the AWS backend will fail.

-                self.s3_client.create_bucket(
-                    Bucket=self.bucket_name,
-                    CreateBucketConfiguration={"LocationConstraint": self.region},
-                )
+                if self.region == "us-east-1":
+                    self.s3_client.create_bucket(Bucket=self.bucket_name)
+                else:
+                    self.s3_client.create_bucket(
+                        Bucket=self.bucket_name,
+                        CreateBucketConfiguration={"LocationConstraint": self.region},
+                    )

🧹 Nitpick comments (3)

hivemind_etl/storage/s3_client.py (3)
120-120: Improve exception handling with explicit chaining

When re-raising exceptions inside an except block, it's best practice to use explicit exception chaining with raise ... from to maintain the exception context.
-                raise ValueError(f"No data found for key: {key}")
+                raise ValueError(f"No data found for key: {key}") from None
🧰 Tools

🪛 Ruff (0.11.9)

120-120: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling

(B904)

112-123: Missing method to retrieve latest data by prefix

The class lacks a method to retrieve the latest stored data for a given prefix. This functionality would be valuable for the ETL workflow to retrieve the most recently extracted or transformed data.

Would you like me to provide an implementation for a get_latest_data method that correctly retrieves the latest object by timestamp? This method should:

List all objects with a given prefix

Sort by last modified date (descending)

Return the most recent object's data

🧰 Tools

🪛 Ruff (0.11.9)

120-120: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling

(B904)

13-76: Consider adding documentation for the class and logging successful operations

The class would benefit from docstring documentation explaining its purpose and usage. Additionally, consider adding logging for successful store operations to maintain consistent logging throughout the class.
 class S3Client:
+    """
+    Client for interacting with S3-compatible storage services.
+    
+    This class handles operations for storing and retrieving data in an S3 bucket
+    for the ETL pipeline, managing extracted and transformed data by community ID.
+    It requires AWS credentials and connection details provided via environment variables.
+    """
     def __init__(self):
And for store operations:
         )
+        logging.info(f"Successfully stored extracted data at key: {key}")
         return key

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
Cache: Disabled due to data retention organization setting
Knowledge Base: Disabled due to data retention organization setting

📥 Commits

Reviewing files that changed from the base of the PR and between 09acc1e and f4d0360.

📒 Files selected for processing (1)

hivemind_etl/storage/s3_client.py (1 hunks)

🧰 Additional context used

🪛 Ruff (0.11.9)

hivemind_etl/storage/s3_client.py

120-120: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling

(B904)

⏰ Context from checks skipped due to timeout of 90000ms (2)

GitHub Check: ci / test / Test
GitHub Check: ci / lint / Lint

coderabbitai

Actionable comments posted: 3

🔭 Outside diff range comments (1)

hivemind_etl/storage/s3_client.py (1)

125-125: 🛠️ Refactor suggestion

Add missing get_latest_data method

Based on the previous review comments, a get_latest_data method was planned but is missing from this implementation.

Here's an implementation that correctly returns the latest data by using the LastModified timestamp:

def get_latest_data(self, community_id: str, activity_type: str) -> Dict[str, Any]:
    """Get the latest data for a community_id and activity_type."""
    prefix = f"{community_id}/{activity_type}/"
    
    try:
        response = self.s3_client.list_objects_v2(
            Bucket=self.bucket_name,
            Prefix=prefix,
        )

        if "Contents" not in response or len(response["Contents"]) == 0:
            logging.error(f"No data found for prefix: {prefix}")
            return None
            
        latest_key = max(
            response["Contents"],
            key=lambda obj: obj["LastModified"],
        )["Key"]
        
        logging.info(f"Found latest key: {latest_key}")
        return self.get_data_by_key(latest_key)
    except ClientError as e:
        logging.error(f"Error listing objects with prefix {prefix}: {str(e)}")
        raise

♻️ Duplicate comments (1)

hivemind_etl/storage/s3_client.py (1)

70-73: ⚠️ Potential issue

Fix bucket creation for us-east-1 region

AWS S3 requires omitting CreateBucketConfiguration when the region is us-east-1. Although the previous review comment was marked as fixed, the code still doesn't handle this special case.

-                self.s3_client.create_bucket(
-                    Bucket=self.bucket_name,
-                    CreateBucketConfiguration={"LocationConstraint": self.region},
-                )
+                if self.region == "us-east-1":
+                    self.s3_client.create_bucket(Bucket=self.bucket_name)
+                else:
+                    self.s3_client.create_bucket(
+                        Bucket=self.bucket_name,
+                        CreateBucketConfiguration={"LocationConstraint": self.region},
+                    )

🧹 Nitpick comments (1)

hivemind_etl/storage/s3_client.py (1)
122-122: Use exception chaining with raise ... from e

To properly maintain the exception chain for debugging, use raise ... from e syntax.
-                raise ValueError(f"No data found for key: {key}")
+                raise ValueError(f"No data found for key: {key}") from e
🧰 Tools

🪛 Ruff (0.11.9)

122-122: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling

(B904)

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
Cache: Disabled due to data retention organization setting
Knowledge Base: Disabled due to data retention organization setting

📥 Commits

Reviewing files that changed from the base of the PR and between f4d0360 and f526f51.

📒 Files selected for processing (1)

hivemind_etl/storage/s3_client.py (1 hunks)

🧰 Additional context used

🪛 Ruff (0.11.9)

hivemind_etl/storage/s3_client.py

122-122: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling

(B904)

⏰ Context from checks skipped due to timeout of 90000ms (2)

GitHub Check: ci / lint / Lint
GitHub Check: ci / test / Test

hivemind_etl/storage/s3_client.py

feat: Added s3 support for mediaWiki ETL!

0cb92ea

coderabbitai bot reviewed May 20, 2025

View reviewed changes

hivemind_etl/storage/s3_client.py Outdated Show resolved Hide resolved

hivemind_etl/storage/s3_client.py Show resolved Hide resolved

hivemind_etl/mediawiki/activities.py Show resolved Hide resolved

hivemind_etl/mediawiki/activities.py Show resolved Hide resolved

amindadgar added 2 commits May 20, 2025 15:02

feat: adjust function docstring!

09acc1e

feat: removed unused codes!

f4d0360

coderabbitai bot reviewed May 20, 2025

View reviewed changes

fix: ensure AWS_REGION is required for S3Client initialization!

f526f51

coderabbitai bot reviewed May 20, 2025

View reviewed changes

hivemind_etl/storage/s3_client.py Show resolved Hide resolved

hivemind_etl/storage/s3_client.py Show resolved Hide resolved

hivemind_etl/storage/s3_client.py Show resolved Hide resolved

cyri113 merged commit 7bb6ef5 into main May 20, 2025
3 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: Added s3 support for mediaWiki ETL! #44

feat: Added s3 support for mediaWiki ETL! #44

Uh oh!

amindadgar commented May 20, 2025 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented May 20, 2025 •

edited

Loading

Chat

Support

CodeRabbit Commands (Invoked using PR comments)

Other keywords and placeholders

CodeRabbit Configuration File (`.coderabbit.yaml`)

Documentation and Community

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

feat: Added s3 support for mediaWiki ETL! #44

feat: Added s3 support for mediaWiki ETL! #44

Uh oh!

Conversation

amindadgar commented May 20, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented May 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Poem

Chat

Support

CodeRabbit Commands (Invoked using PR comments)

Other keywords and placeholders

CodeRabbit Configuration File (.coderabbit.yaml)

Documentation and Community

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

amindadgar commented May 20, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented May 20, 2025 •

edited

Loading

CodeRabbit Configuration File (`.coderabbit.yaml`)