-
Notifications
You must be signed in to change notification settings - Fork 0
fix: re-instantiating the crawler per each route! #12
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
it seems it was doing some caching (or limiting the urls to 20) even in case more urls were requested to be crawled. now we're re-instantiating the crawler client which it will crawl max 20 urls per each given route.
WalkthroughThe changes refactor the way the Changes
Possibly related PRs
Poem
✨ Finishing Touches
🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (2)
hivemind_etl/website/website_etl.py (2)
56-57: Improved logging and client usage.The updated logging statement now includes the specific route being crawled, which enhances clarity and makes debugging easier. The change to use the local
crawlee_clientinstance is consistent with the instantiation change.However, consider adding error handling for the crawl operation to improve robustness.
for url in urls: crawlee_client = CrawleeClient() logging.info(f"Crawling {url} and its routes!") - data = await crawlee_client.crawl(links=[url]) - logging.info(f"{len(data)} data is extracted for route: {url}") + try: + data = await crawlee_client.crawl(links=[url]) + logging.info(f"{len(data)} data is extracted for route: {url}") + except Exception as e: + logging.error(f"Error crawling {url}: {str(e)}") + continue extracted_data.extend(data)
52-65: Consider performance optimization for multiple URLs.The current implementation processes URLs sequentially. For better performance with multiple URLs, consider implementing concurrent processing using asyncio's gathering capabilities.
if not urls: raise ValueError("No URLs provided for crawling") extracted_data = [] -for url in urls: - crawlee_client = CrawleeClient() - logging.info(f"Crawling {url} and its routes!") - data = await crawlee_client.crawl(links=[url]) - logging.info(f"{len(data)} data is extracted for route: {url}") - extracted_data.extend(data) + +async def crawl_url(url: str) -> list[dict[str, Any]]: + crawlee_client = CrawleeClient() + logging.info(f"Crawling {url} and its routes!") + try: + data = await crawlee_client.crawl(links=[url]) + logging.info(f"{len(data)} data is extracted for route: {url}") + return data + except Exception as e: + logging.error(f"Error crawling {url}: {str(e)}") + return [] + +import asyncio +results = await asyncio.gather(*[crawl_url(url) for url in urls]) +for result in results: + extracted_data.extend(result)This approach would crawl all URLs concurrently, potentially reducing the total execution time significantly.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
hivemind_etl/website/website_etl.py(2 hunks)
⏰ Context from checks skipped due to timeout of 90000ms (1)
- GitHub Check: ci / test / Test
🔇 Additional comments (2)
hivemind_etl/website/website_etl.py (2)
27-27: Good removal of the persistent client instance.Removing the persistent
CrawleeClientinstance aligns with the PR objective of fixing the URL limitation issue. This change prevents the client from maintaining state across multiple crawl operations.
54-54: Effective implementation of creating a fresh client per URL.Creating a new
CrawleeClientinstance for each URL ensures that any internal caching or state limitations are reset between crawl operations. This directly addresses the issue mentioned in the PR objectives where the crawler was limiting URLs to 20 even when more were requested.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🧹 Nitpick comments (1)
workflows.py (1)
47-54: Consider increasing maximum_attempts for transform_dataWhile you've increased the timeout for transform_data from 5 to 10 minutes, you've kept maximum_attempts at 1, meaning there are no retries for this activity if it fails.
Consider increasing maximum_attempts to match the other activities for consistency and resilience:
retry_policy=RetryPolicy( initial_interval=timedelta(seconds=5), maximum_interval=timedelta(minutes=2), - maximum_attempts=1, + maximum_attempts=3, ),
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (3)
hivemind_etl/website/website_etl.py(2 hunks)tests/unit/test_website_etl.py(2 hunks)workflows.py(4 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
- hivemind_etl/website/website_etl.py
⏰ Context from checks skipped due to timeout of 90000ms (2)
- GitHub Check: ci / test / Test
- GitHub Check: ci / lint / Lint
🔇 Additional comments (3)
workflows.py (3)
59-65: Verify if the extremely long timeout (60 minutes) for load_data is necessaryThe timeout for the load_data activity has been significantly increased from 5 to 60 minutes. While the re-instantiation of the crawler per route may result in more data being processed, this is a substantial increase.
Consider profiling the load_data activity with the new implementation to confirm whether such a long timeout is really needed or if a more moderate increase would be sufficient. Excessive timeouts can unnecessarily tie up resources.
37-43: LGTM: Increased timeout and retries for extraction are appropriateThe increased timeout (30 minutes) and retry attempts (3) for the extract_website activity align well with the changes to re-instantiate the crawler client per route. This provides more resilience and accommodates the potentially increased processing time.
89-89:Details
❓ Verification inconclusive
Ensure workflow ID uniqueness is maintained
The workflow ID format has been changed from a timestamp-based approach to just using the community ID. This could potentially lead to conflicts if multiple workflows for the same community are started in close succession.
Consider adding a timestamp or unique identifier if there's any possibility of concurrent workflows for the same community.
🏁 Script executed:
#!/bin/bash # Search for any code that might rely on workflow ID uniqueness rg "website:ingestor:" -A 3 -B 3Length of output: 436
Workflow ID Uniqueness Concern in
workflows.pyAt line 89 the workflow ID is set using:
id=f"website:ingestor:{community['community_id']}"This change means that the ID now depends solely on the community ID. If there is any chance that multiple workflows for the same community might be initiated closely in time, the current scheme could lead to conflicts. Please verify whether concurrent workflows for the same community are a possibility. If they are, consider appending an additional unique element such as a timestamp or UUID to the workflow ID to guarantee uniqueness.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (1)
tests/unit/test_website_etl.py (1)
4-4: Remove unused pytest import.The
pytestmodule is imported but not used anywhere in the file. Consider removing this import to keep the codebase clean.-import pytest🧰 Tools
🪛 Ruff (0.8.2)
4-4:
pytestimported but unusedRemove unused import:
pytest(F401)
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
tests/unit/test_website_etl.py(2 hunks)
🧰 Additional context used
🧬 Code Definitions (1)
tests/unit/test_website_etl.py (2)
hivemind_etl/website/crawlee_client.py (1)
crawl(78-116)hivemind_etl/website/website_etl.py (1)
extract(31-64)
🪛 Ruff (0.8.2)
tests/unit/test_website_etl.py
4-4: pytest imported but unused
Remove unused import: pytest
(F401)
⏰ Context from checks skipped due to timeout of 90000ms (1)
- GitHub Check: ci / test / Test
🔇 Additional comments (1)
tests/unit/test_website_etl.py (1)
33-44: LGTM! Excellent test refactoring.The test has been properly updated to reflect the new implementation where
CrawleeClientis instantiated per route. The use ofpatchto mock the class itself rather than the instance is the correct approach, and the additional assertions provide better test coverage by verifying both instantiation and method calls.This implementation perfectly addresses the previous review comment and aligns with the PR objective of re-instantiating the crawler for each route.
it seems it was doing some caching (or limiting the urls to 20) even in case more urls were requested to be crawled. now we're re-instantiating the crawler client which it will crawl max 20 urls per each given route.
Summary by CodeRabbit
pytestmodule and updated thetest_extractmethod to use mocking for improved test isolation and accuracy.