-
Notifications
You must be signed in to change notification settings - Fork 0
feat: enhance community fetching with platform filtering and update w… #3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
WalkthroughThis pull request adds an optional parameter for filtering communities and platforms by a specified platform ID. The changes propagate through the ETL layer and website module, updating function signatures, docstrings, and logging. Additionally, the scheduling logic in the workflow and test files is adjusted to incorporate the new parameter, with modified timing and retry configurations. A new integration test confirms that filtering returns the expected single-platform data when multiple platforms exist. Changes
Sequence Diagram(s)sequenceDiagram
participant S as Scheduler
participant W as Workflow
participant A as Activities (get_communities)
participant M as ModulesWebsite
S->>W: run(platform_id)
W->>A: execute_activity("get_communities", platform_id)
A->>M: get_learning_platforms(filter_platform_id=platform_id)
M-->>A: Return filtered community data
A-->>W: Return community data
Poem
✨ Finishing Touches
Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media? 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🧹 Nitpick comments (4)
hivemind_etl/website/module.py (2)
11-14: Consider using more specific type hints.The return type hint could be more specific by using TypedDict to define the exact structure of the returned dictionary.
+from typing import TypedDict + +class CommunityData(TypedDict): + community_id: str + platform_id: str + urls: list[str] - ) -> list[dict[str, str | list[str]]]: + ) -> list[CommunityData]:
51-54: Optimize filtering by moving it before expensive operations.The filtering condition should be moved before the metadata fetch operation to avoid unnecessary database calls.
platform_id = platform["platform"] - # if we needed to get specific platforms - if filter_platform_id and filter_platform_id != str(platform_id): - continue - try: + # if we needed to get specific platforms + if filter_platform_id and filter_platform_id != str(platform_id): + continue + website_links = self.get_platform_metadata(tests/integration/test_module_website.py (2)
132-137: Consider renaming test method for clarity and conciseness.The current test name is quite long. Consider a more concise name that still captures the intent, such as
test_get_platforms_with_filterortest_platform_filtering.Additionally, the docstring could be more precise about what is being tested:
- def test_get_website_communities_data_module_multiple_platforms_filtered_one(self): + def test_get_platforms_with_filter(self): """ - Test get_learning_platforms when a community has multiple platforms but with a platform filter applied - Verifies that only website platform data is returned even when - other platform types exist. + Test that get_learning_platforms correctly filters results when: + - Multiple 'website' platforms exist for a community + - A specific platform_id filter is applied + Expected: Only data for the specified platform should be returned """
132-204: Add test coverage for edge cases and error scenarios.The current test covers the happy path but misses important edge cases. Consider adding tests for:
- Invalid platform ID format
- Non-existent platform ID
- Empty resources array
- Null/undefined resources
Would you like me to help generate additional test cases for these scenarios?
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (5)
hivemind_etl/activities.py(1 hunks)hivemind_etl/website/module.py(2 hunks)test_run_workflow.py(1 hunks)tests/integration/test_module_website.py(1 hunks)workflows.py(1 hunks)
⏰ Context from checks skipped due to timeout of 90000ms (2)
- GitHub Check: ci / test / Test
- GitHub Check: ci / lint / Lint
🔇 Additional comments (3)
hivemind_etl/activities.py (1)
16-42: LGTM! Well-documented changes with proper error handling.The changes maintain consistency with the module layer and include appropriate logging for filtered ingestion.
workflows.py (1)
77-80:Details
❓ Verification inconclusive
Verify the need for increased timeout and retries.
The significant changes to timeout (20 minutes) and retries (3 attempts) might indicate underlying performance or reliability issues:
- A 20-minute timeout is quite long for fetching community data
- Multiple retries might mask underlying problems
Could you help verify:
- Are there known performance issues causing timeouts?
- What types of failures are being retried?
🏁 Script executed:
#!/bin/bash # Search for timeout-related issues in logs or error handling rg -A 5 "timeout|timed out" . # Look for error patterns in exception handling rg -A 5 "Error fetching communities:" .Length of output: 5177
Action Required: Validate Timeout and Retry Configurations in Workflows
After reviewing the repository’s timeout and retry settings, it appears that the 20‑minute
start_to_close_timeoutandRetryPolicy(maximum_attempts=3)inworkflows.pyare deliberate adjustments to support a specific long‑running workflow—likely distinct from other, shorter timeout configurations (5–10 minutes) used elsewhere. The grep output also indicates that different components (e.g., container healthchecks and asynchronous crawlers) use a variety of timeout values, suggesting that these differences are intentional rather than symptomatic of general performance issues.To ensure these settings are appropriate, please:
- Confirm that the 20‑minute timeout is indeed necessary for the community data fetch tasks and not an overestimation.
- Clarify which failure scenarios specifically require up to three retry attempts, as this contrasts with the single retry attempts seen in other workflows.
Based on current evidence, there’s no direct sign of widespread performance or reliability issues triggering these choices. However, verifying that these values align with the intended operational behavior would help avoid inadvertently masking underlying problems.
tests/integration/test_module_website.py (1)
189-191: LGTM! Clear and explicit test assertion.The test clearly demonstrates the new filtering functionality by passing a specific platform ID and verifying the filtered results.
…orkflow timeout
Summary by CodeRabbit