-
Notifications
You must be signed in to change notification settings - Fork 0
feat: update qdrant platform data naming! #34
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
- update qdrant collection name to platform id - update mediawiki and website activities to include platform id - update mediawiki and website module to include platform id - update mediawiki and website workflows to include platform id
|
Warning Rate limit exceeded@amindadgar has exceeded the limit for the number of commits or files that can be reviewed per hour. Please wait 19 minutes and 1 seconds before requesting another review. ⌛ How to resolve this issue?After the wait time has elapsed, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout. Please see our FAQ for further information. 📒 Files selected for processing (1)
WalkthroughThis set of changes introduces the Changes
Sequence Diagram(s)sequenceDiagram
participant Workflow
participant Activities
participant ETLClass
participant IngestionPipeline
Workflow->>Activities: extract_website(urls, community_id, platform_id)
Activities->>ETLClass: WebsiteETL(community_id, platform_id)
Activities->>Activities: transform_website_data(raw_data, community_id, platform_id)
Activities->>ETLClass: WebsiteETL(community_id, platform_id)
Activities->>Activities: load_website_data(documents, community_id, platform_id)
Activities->>ETLClass: WebsiteETL(community_id, platform_id)
ETLClass->>IngestionPipeline: CustomIngestionPipeline(collection_name=platform_id)
Possibly related PRs
Poem
✨ Finishing Touches
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🔭 Outside diff range comments (1)
hivemind_etl/website/website_etl.py (1)
21-25: 🛠️ Refactor suggestionAdd validation for platform_id similar to community_id
While community_id has validation to ensure it's a non-empty string, the platform_id parameter lacks similar validation, which could lead to unexpected behavior if empty or null values are passed.
if not community_id or not isinstance(community_id, str): raise ValueError("community_id must be a non-empty string") + if not platform_id or not isinstance(platform_id, str): + raise ValueError("platform_id must be a non-empty string") self.community_id = community_id self.platform_id = platform_id
🧹 Nitpick comments (3)
hivemind_etl/website/activities.py (3)
59-60: Include platform_id in error loggingWhile the success logs include platform_id, the error logs don't. For consistency and better error tracing, include platform_id in the error logs as well.
- logging.error(f"Error in extraction for community {community_id}: {str(e)}") + logging.error(f"Error in extraction for community {community_id} | platform {platform_id}: {str(e)}")
77-78: Include platform_id in error loggingFor consistency with the success logs, include platform_id in the error logs as well.
- logging.error(f"Error in transformation for community {community_id}: {str(e)}") + logging.error(f"Error in transformation for community {community_id} | platform {platform_id}: {str(e)}")
94-95: Include platform_id in error loggingFor consistency with the success logs, include platform_id in the error logs as well.
- logging.error(f"Error in data load for community {community_id}: {str(e)}") + logging.error(f"Error in data load for community {community_id} | platform {platform_id}: {str(e)}")
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (7)
hivemind_etl/mediawiki/activities.py(4 hunks)hivemind_etl/mediawiki/etl.py(2 hunks)hivemind_etl/mediawiki/module.py(2 hunks)hivemind_etl/website/activities.py(3 hunks)hivemind_etl/website/module.py(1 hunks)hivemind_etl/website/website_etl.py(2 hunks)hivemind_etl/website/workflows.py(3 hunks)
🧰 Additional context used
🧬 Code Graph Analysis (2)
hivemind_etl/mediawiki/activities.py (1)
hivemind_etl/mediawiki/etl.py (1)
MediawikiETL(11-108)
hivemind_etl/website/activities.py (1)
hivemind_etl/website/website_etl.py (1)
WebsiteETL(9-107)
⏰ Context from checks skipped due to timeout of 90000ms (2)
- GitHub Check: ci / lint / Lint
- GitHub Check: ci / test / Test
🔇 Additional comments (22)
hivemind_etl/website/website_etl.py (1)
28-30: Good implementation of dynamic collection namingThe collection name is now properly derived from the platform_id instead of being hardcoded, which aligns with the PR objective of updating Qdrant collection naming to use platform ID.
hivemind_etl/website/workflows.py (3)
32-33: Correctly updated activity execution with platform_id parameterThe platform_id is now properly passed to the extract_website activity, ensuring it uses the correct collection name.
43-44: Correctly updated activity execution with platform_id parameterThe platform_id is now properly passed to the transform_website_data activity, maintaining consistent context throughout the workflow.
54-55: Correctly updated activity execution with platform_id parameterThe platform_id is now properly passed to the load_website_data activity, completing the workflow chain with consistent platform context.
hivemind_etl/website/activities.py (6)
46-48: Function signature correctly updated with platform_id parameterThe extract_website activity function now properly accepts the platform_id parameter, aligning with the changes in the WebsiteETL class and workflows.
52-55: Enhanced logging and properly updated ETL instantiationThe logging now includes platform_id for better traceability, and the WebsiteETL is correctly instantiated with both community_id and platform_id.
64-66: Function signature correctly updated with platform_id parameterThe transform_website_data activity function now properly accepts the platform_id parameter, maintaining consistency across the ETL pipeline.
69-73: Enhanced logging and properly updated ETL instantiationThe logging now includes platform_id for better traceability, and the WebsiteETL is correctly instantiated with both community_id and platform_id.
82-84: Function signature correctly updated with platform_id parameterThe load_website_data activity function now properly accepts the platform_id parameter, completing the consistent updates across all activities.
87-91: Enhanced logging and properly updated ETL instantiationThe logging now includes platform_id for better traceability, and the WebsiteETL is correctly instantiated with both community_id and platform_id.
hivemind_etl/website/module.py (1)
32-34: Good documentation update with generic placeholdersReplacing specific IDs with generic placeholders in the documentation is a good practice to avoid exposing potentially sensitive information.
hivemind_etl/mediawiki/activities.py (6)
61-61: LGTM: platform_id extraction from mediawiki_platform dictionaryThe addition of platform_id extraction is consistent with the updated ETL constructor requirements.
66-70: LGTM: platform_id now passed to MediawikiETL constructorThe updated constructor parameters correctly include the platform_id, which is now required by the MediawikiETL class.
86-86: LGTM: platform_id extraction in transform_mediawiki_data functionConsistent extraction of platform_id from the mediawiki_platform dictionary.
91-95: LGTM: MediawikiETL constructor correctly includes platform_idThe platform_id is properly passed to the constructor, consistent with the changes in the extract_mediawiki function.
108-108: LGTM: platform_id extraction in load_mediawiki_data functionConsistent extraction of platform_id from the mediawiki_platform dictionary.
117-121: LGTM: MediawikiETL constructor correctly includes platform_idThe platform_id is properly passed to the constructor, ensuring consistency across all three activity functions.
hivemind_etl/mediawiki/etl.py (3)
16-17: LGTM: Added platform_id parameter to constructorThe platform_id parameter is correctly added as a required parameter in the MediawikiETL constructor.
20-20: LGTM: Storing platform_id as instance attributeThe platform_id is properly stored as an instance attribute for later use in the ETL process.
101-101: LGTM: Using platform_id as collection nameThe platform_id is now correctly used as the collection name in the CustomIngestionPipeline, replacing the previously hardcoded collection name.
hivemind_etl/mediawiki/module.py (2)
31-33: LGTM: Updated example documentationThe example documentation is properly updated to include the platform_id field, providing clear guidance for users of this method.
91-92: LGTM: Added platform_id to returned dictionaryThe platform_id is correctly added to the communities_data dictionary and properly converted to a string, consistent with how other ID fields are handled.
Accepting platform id in test cases!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (3)
tests/integration/test_mediawiki_modules.py(5 hunks)tests/unit/test_mediawiki_etl.py(8 hunks)tests/unit/test_website_etl.py(1 hunks)
🧰 Additional context used
🧬 Code Graph Analysis (1)
tests/unit/test_website_etl.py (1)
hivemind_etl/website/website_etl.py (1)
WebsiteETL(9-107)
⏰ Context from checks skipped due to timeout of 90000ms (1)
- GitHub Check: ci / test / Test
🔇 Additional comments (4)
tests/unit/test_website_etl.py (1)
17-18: LGTM: Test correctly updated to use platform_id parameter.The test setup has been properly modified to include the new
platform_idparameter when initializing theWebsiteETLclass, which aligns with the changes made to the actual implementation. The test value "test_platform" is appropriate for testing purposes.tests/unit/test_mediawiki_etl.py (1)
15-15: Good addition of platform_id in setUp method.This ensures all test methods can consistently use the same platform_id parameter.
tests/integration/test_mediawiki_modules.py (2)
70-70: Good addition of platform_id assertion.The test now correctly verifies that the platform_id is included in the API response, which aligns with the updated implementation.
150-151: Assertions properly updated to include platform_id.These changes ensure that all test cases consistently verify the presence and correctness of the platform_id field in the returned data structures, which is essential for validating the platform-specific processing.
Also applies to: 159-160, 243-244, 325-326
Summary by CodeRabbit
New Features
Documentation
Refactor