Skip to content

Conversation

@amindadgar
Copy link
Member

@amindadgar amindadgar commented May 4, 2025

  • update qdrant collection name to platform id
  • update mediawiki and website activities to include platform id
  • update mediawiki and website module to include platform id
  • update mediawiki and website workflows to include platform id

Summary by CodeRabbit

  • New Features

    • Added support for using a configurable platform ID when processing and storing both MediaWiki and website data.
    • The structure of returned data from platform discovery now includes a platform ID for each community.
  • Documentation

    • Updated example values in documentation to use generic placeholders for platform and community IDs.
  • Refactor

    • Enhanced logging to include platform ID for improved traceability.
    • Updated function and class signatures to require a platform ID parameter for website data processing.

- update qdrant collection name to platform id
- update mediawiki and website activities to include platform id
- update mediawiki and website module to include platform id
- update mediawiki and website workflows to include platform id
@coderabbitai
Copy link

coderabbitai bot commented May 4, 2025

Warning

Rate limit exceeded

@amindadgar has exceeded the limit for the number of commits or files that can be reviewed per hour. Please wait 19 minutes and 1 seconds before requesting another review.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

📥 Commits

Reviewing files that changed from the base of the PR and between 653a992 and 64c27a6.

📒 Files selected for processing (1)
  • hivemind_etl/website/website_etl.py (1 hunks)

Walkthrough

This set of changes introduces the platform_id parameter throughout the MediaWiki and website ETL pipelines. The platform_id is now extracted and passed through relevant activity functions, constructors, and workflow calls, replacing previously hardcoded collection names with this dynamic value. The data structures returned by get_learning_platforms methods are updated to include platform_id. Function signatures and class constructors have been modified where needed to accommodate this new parameter, and logging statements now include platform_id for improved traceability.

Changes

File(s) Change Summary
hivemind_etl/mediawiki/activities.py Extracts platform_id from input and passes it to MediawikiETL in all ETL activity functions.
hivemind_etl/mediawiki/etl.py Updates MediawikiETL constructor to require platform_id; uses it as the collection name in the load process.
hivemind_etl/mediawiki/module.py Updates get_learning_platforms to include platform_id in returned dictionaries and docstring examples.
hivemind_etl/website/activities.py Adds platform_id as a required parameter to all website ETL activity functions; updates logging to include platform_id.
hivemind_etl/website/module.py Updates docstring example in get_learning_platforms to use generic placeholder IDs.
hivemind_etl/website/website_etl.py Updates WebsiteETL constructor to require platform_id; uses it as the collection name instead of a hardcoded value.
hivemind_etl/website/workflows.py Passes platform_id to activity calls in CommunityWebsiteWorkflow.run.
tests/integration/test_mediawiki_modules.py Adds assertions to verify the presence and correctness of platform_id in returned data structures.
tests/unit/test_mediawiki_etl.py Adds platform_id parameter to all MediawikiETL instantiations in tests.
tests/unit/test_website_etl.py Updates WebsiteETL instantiation in tests to include platform_id.

Sequence Diagram(s)

sequenceDiagram
    participant Workflow
    participant Activities
    participant ETLClass
    participant IngestionPipeline

    Workflow->>Activities: extract_website(urls, community_id, platform_id)
    Activities->>ETLClass: WebsiteETL(community_id, platform_id)
    Activities->>Activities: transform_website_data(raw_data, community_id, platform_id)
    Activities->>ETLClass: WebsiteETL(community_id, platform_id)
    Activities->>Activities: load_website_data(documents, community_id, platform_id)
    Activities->>ETLClass: WebsiteETL(community_id, platform_id)
    ETLClass->>IngestionPipeline: CustomIngestionPipeline(collection_name=platform_id)
Loading

Possibly related PRs

  • #15: Introduces the initial implementation of the MediawikiETL class and its core ETL methods, which are extended in this PR with the new platform_id parameter.
  • #22: Modifies the MediawikiETL constructor usage by adding a namespaces argument; both PRs update the same activity functions to pass additional parameters to MediawikiETL.
  • #21: Changes Mediawiki ETL activities to accept a single dictionary parameter instead of separate parameters; related to parameter handling in the same functions.

Poem

In the warren of code, a new path we find,
With platform_id hopping through functions entwined.
No more hardcoded fields, just dynamic delight,
Each ETL bunny now names collections right.
🐇 With IDs in tow, we leap and we bound—
Platform-aware pipelines, robust and sound!

✨ Finishing Touches
  • 📝 Generate Docstrings

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Generate unit testing code for this file.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai generate unit testing code for this file.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and generate unit testing code.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai generate docstrings to generate docstrings for this PR.
  • @coderabbitai generate sequence diagram to generate a sequence diagram of the changes in this PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🔭 Outside diff range comments (1)
hivemind_etl/website/website_etl.py (1)

21-25: 🛠️ Refactor suggestion

Add validation for platform_id similar to community_id

While community_id has validation to ensure it's a non-empty string, the platform_id parameter lacks similar validation, which could lead to unexpected behavior if empty or null values are passed.

  if not community_id or not isinstance(community_id, str):
      raise ValueError("community_id must be a non-empty string")

+ if not platform_id or not isinstance(platform_id, str):
+     raise ValueError("platform_id must be a non-empty string")

  self.community_id = community_id
  self.platform_id = platform_id
🧹 Nitpick comments (3)
hivemind_etl/website/activities.py (3)

59-60: Include platform_id in error logging

While the success logs include platform_id, the error logs don't. For consistency and better error tracing, include platform_id in the error logs as well.

- logging.error(f"Error in extraction for community {community_id}: {str(e)}")
+ logging.error(f"Error in extraction for community {community_id} | platform {platform_id}: {str(e)}")

77-78: Include platform_id in error logging

For consistency with the success logs, include platform_id in the error logs as well.

- logging.error(f"Error in transformation for community {community_id}: {str(e)}")
+ logging.error(f"Error in transformation for community {community_id} | platform {platform_id}: {str(e)}")

94-95: Include platform_id in error logging

For consistency with the success logs, include platform_id in the error logs as well.

- logging.error(f"Error in data load for community {community_id}: {str(e)}")
+ logging.error(f"Error in data load for community {community_id} | platform {platform_id}: {str(e)}")
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 40a8f51 and 37dd1b0.

📒 Files selected for processing (7)
  • hivemind_etl/mediawiki/activities.py (4 hunks)
  • hivemind_etl/mediawiki/etl.py (2 hunks)
  • hivemind_etl/mediawiki/module.py (2 hunks)
  • hivemind_etl/website/activities.py (3 hunks)
  • hivemind_etl/website/module.py (1 hunks)
  • hivemind_etl/website/website_etl.py (2 hunks)
  • hivemind_etl/website/workflows.py (3 hunks)
🧰 Additional context used
🧬 Code Graph Analysis (2)
hivemind_etl/mediawiki/activities.py (1)
hivemind_etl/mediawiki/etl.py (1)
  • MediawikiETL (11-108)
hivemind_etl/website/activities.py (1)
hivemind_etl/website/website_etl.py (1)
  • WebsiteETL (9-107)
⏰ Context from checks skipped due to timeout of 90000ms (2)
  • GitHub Check: ci / lint / Lint
  • GitHub Check: ci / test / Test
🔇 Additional comments (22)
hivemind_etl/website/website_etl.py (1)

28-30: Good implementation of dynamic collection naming

The collection name is now properly derived from the platform_id instead of being hardcoded, which aligns with the PR objective of updating Qdrant collection naming to use platform ID.

hivemind_etl/website/workflows.py (3)

32-33: Correctly updated activity execution with platform_id parameter

The platform_id is now properly passed to the extract_website activity, ensuring it uses the correct collection name.


43-44: Correctly updated activity execution with platform_id parameter

The platform_id is now properly passed to the transform_website_data activity, maintaining consistent context throughout the workflow.


54-55: Correctly updated activity execution with platform_id parameter

The platform_id is now properly passed to the load_website_data activity, completing the workflow chain with consistent platform context.

hivemind_etl/website/activities.py (6)

46-48: Function signature correctly updated with platform_id parameter

The extract_website activity function now properly accepts the platform_id parameter, aligning with the changes in the WebsiteETL class and workflows.


52-55: Enhanced logging and properly updated ETL instantiation

The logging now includes platform_id for better traceability, and the WebsiteETL is correctly instantiated with both community_id and platform_id.


64-66: Function signature correctly updated with platform_id parameter

The transform_website_data activity function now properly accepts the platform_id parameter, maintaining consistency across the ETL pipeline.


69-73: Enhanced logging and properly updated ETL instantiation

The logging now includes platform_id for better traceability, and the WebsiteETL is correctly instantiated with both community_id and platform_id.


82-84: Function signature correctly updated with platform_id parameter

The load_website_data activity function now properly accepts the platform_id parameter, completing the consistent updates across all activities.


87-91: Enhanced logging and properly updated ETL instantiation

The logging now includes platform_id for better traceability, and the WebsiteETL is correctly instantiated with both community_id and platform_id.

hivemind_etl/website/module.py (1)

32-34: Good documentation update with generic placeholders

Replacing specific IDs with generic placeholders in the documentation is a good practice to avoid exposing potentially sensitive information.

hivemind_etl/mediawiki/activities.py (6)

61-61: LGTM: platform_id extraction from mediawiki_platform dictionary

The addition of platform_id extraction is consistent with the updated ETL constructor requirements.


66-70: LGTM: platform_id now passed to MediawikiETL constructor

The updated constructor parameters correctly include the platform_id, which is now required by the MediawikiETL class.


86-86: LGTM: platform_id extraction in transform_mediawiki_data function

Consistent extraction of platform_id from the mediawiki_platform dictionary.


91-95: LGTM: MediawikiETL constructor correctly includes platform_id

The platform_id is properly passed to the constructor, consistent with the changes in the extract_mediawiki function.


108-108: LGTM: platform_id extraction in load_mediawiki_data function

Consistent extraction of platform_id from the mediawiki_platform dictionary.


117-121: LGTM: MediawikiETL constructor correctly includes platform_id

The platform_id is properly passed to the constructor, ensuring consistency across all three activity functions.

hivemind_etl/mediawiki/etl.py (3)

16-17: LGTM: Added platform_id parameter to constructor

The platform_id parameter is correctly added as a required parameter in the MediawikiETL constructor.


20-20: LGTM: Storing platform_id as instance attribute

The platform_id is properly stored as an instance attribute for later use in the ETL process.


101-101: LGTM: Using platform_id as collection name

The platform_id is now correctly used as the collection name in the CustomIngestionPipeline, replacing the previously hardcoded collection name.

hivemind_etl/mediawiki/module.py (2)

31-33: LGTM: Updated example documentation

The example documentation is properly updated to include the platform_id field, providing clear guidance for users of this method.


91-92: LGTM: Added platform_id to returned dictionary

The platform_id is correctly added to the communities_data dictionary and properly converted to a string, consistent with how other ID fields are handled.

Accepting platform id in test cases!
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 37dd1b0 and 33099f7.

📒 Files selected for processing (3)
  • tests/integration/test_mediawiki_modules.py (5 hunks)
  • tests/unit/test_mediawiki_etl.py (8 hunks)
  • tests/unit/test_website_etl.py (1 hunks)
🧰 Additional context used
🧬 Code Graph Analysis (1)
tests/unit/test_website_etl.py (1)
hivemind_etl/website/website_etl.py (1)
  • WebsiteETL (9-107)
⏰ Context from checks skipped due to timeout of 90000ms (1)
  • GitHub Check: ci / test / Test
🔇 Additional comments (4)
tests/unit/test_website_etl.py (1)

17-18: LGTM: Test correctly updated to use platform_id parameter.

The test setup has been properly modified to include the new platform_id parameter when initializing the WebsiteETL class, which aligns with the changes made to the actual implementation. The test value "test_platform" is appropriate for testing purposes.

tests/unit/test_mediawiki_etl.py (1)

15-15: Good addition of platform_id in setUp method.

This ensures all test methods can consistently use the same platform_id parameter.

tests/integration/test_mediawiki_modules.py (2)

70-70: Good addition of platform_id assertion.

The test now correctly verifies that the platform_id is included in the API response, which aligns with the updated implementation.


150-151: Assertions properly updated to include platform_id.

These changes ensure that all test cases consistently verify the presence and correctness of the platform_id field in the returned data structures, which is essential for validating the platform-specific processing.

Also applies to: 159-160, 243-244, 325-326

@amindadgar amindadgar merged commit a2b8526 into main May 6, 2025
3 checks passed
@amindadgar amindadgar linked an issue May 7, 2025 that may be closed by this pull request
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Rename summarizer workflow and activities

2 participants