Skip to content

Conversation

@amindadgar
Copy link
Member

@amindadgar amindadgar commented Apr 2, 2025

it seems it was doing some caching (or limiting the urls to 20) even in case more urls were requested to be crawled. now we're re-instantiating the crawler client which it will crawl max 20 urls per each given route.

Summary by CodeRabbit

  • Refactor
    • Modified the extraction process to instantiate processing components on-demand for each URL.
    • Enhanced logging to display specific route details during extraction, improving the clarity of operational feedback.
  • Bug Fixes
    • Adjusted timeout durations and retry policies for various activities in workflows to improve execution flow and error handling.
  • Tests
    • Introduced pytest module and updated the test_extract method to use mocking for improved test isolation and accuracy.

it seems it was doing some caching (or limiting the urls to 20) even in case more urls were requested to be crawled.
now we're re-instantiating the crawler client which it will crawl max 20 urls per each given route.
@coderabbitai
Copy link

coderabbitai bot commented Apr 2, 2025

Walkthrough

The changes refactor the way the CrawleeClient is instantiated within the WebsiteETL class. The persistent client in the constructor is removed, and a new local instance is now created within the extract method for each URL processed. Additionally, the logging output in the extract method has been updated to include the specific route being crawled, thereby improving the clarity of the log messages. Modifications are also made to timeout durations and retry policies in the workflow classes.

Changes

File Change Summary
hivemind_etl/website/website_etl.py Moved CrawleeClient instantiation from class-level (in __init__) to method-level (in extract) and enhanced logging to include the specific route.
workflows.py Adjusted timeout durations and retry policies for extract_website, transform_data, and load_data activities; modified get_communities timeout and child workflow ID format.
tests/unit/test_website_etl.py Introduced pytest and modified test_extract method to mock CrawleeClient class instead of its instance, improving test isolation.

Possibly related PRs

Poem

In my burrow, codes now gleam,
Each extraction fresh like a dream.
New Crawlee hops in for every route,
With logs so clear, there's no doubt.
I dance through changes, soft and light,
A techy rabbit in the code-filled night!
🐇✨

✨ Finishing Touches
  • 📝 Generate Docstrings

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Generate unit testing code for this file.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai generate unit testing code for this file.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and generate unit testing code.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai generate docstrings to generate docstrings for this PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai plan to trigger planning for file edits and PR creation.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (2)
hivemind_etl/website/website_etl.py (2)

56-57: Improved logging and client usage.

The updated logging statement now includes the specific route being crawled, which enhances clarity and makes debugging easier. The change to use the local crawlee_client instance is consistent with the instantiation change.

However, consider adding error handling for the crawl operation to improve robustness.

 for url in urls:
     crawlee_client = CrawleeClient()
     logging.info(f"Crawling {url} and its routes!")
-    data = await crawlee_client.crawl(links=[url])
-    logging.info(f"{len(data)} data is extracted for route: {url}")
+    try:
+        data = await crawlee_client.crawl(links=[url])
+        logging.info(f"{len(data)} data is extracted for route: {url}")
+    except Exception as e:
+        logging.error(f"Error crawling {url}: {str(e)}")
+        continue
     extracted_data.extend(data)

52-65: Consider performance optimization for multiple URLs.

The current implementation processes URLs sequentially. For better performance with multiple URLs, consider implementing concurrent processing using asyncio's gathering capabilities.

 if not urls:
     raise ValueError("No URLs provided for crawling")

 extracted_data = []
-for url in urls:
-    crawlee_client = CrawleeClient()
-    logging.info(f"Crawling {url} and its routes!")
-    data = await crawlee_client.crawl(links=[url])
-    logging.info(f"{len(data)} data is extracted for route: {url}")
-    extracted_data.extend(data)
+
+async def crawl_url(url: str) -> list[dict[str, Any]]:
+    crawlee_client = CrawleeClient()
+    logging.info(f"Crawling {url} and its routes!")
+    try:
+        data = await crawlee_client.crawl(links=[url])
+        logging.info(f"{len(data)} data is extracted for route: {url}")
+        return data
+    except Exception as e:
+        logging.error(f"Error crawling {url}: {str(e)}")
+        return []
+
+import asyncio
+results = await asyncio.gather(*[crawl_url(url) for url in urls])
+for result in results:
+    extracted_data.extend(result)

This approach would crawl all URLs concurrently, potentially reducing the total execution time significantly.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between c751cfa and 612e73a.

📒 Files selected for processing (1)
  • hivemind_etl/website/website_etl.py (2 hunks)
⏰ Context from checks skipped due to timeout of 90000ms (1)
  • GitHub Check: ci / test / Test
🔇 Additional comments (2)
hivemind_etl/website/website_etl.py (2)

27-27: Good removal of the persistent client instance.

Removing the persistent CrawleeClient instance aligns with the PR objective of fixing the URL limitation issue. This change prevents the client from maintaining state across multiple crawl operations.


54-54: Effective implementation of creating a fresh client per URL.

Creating a new CrawleeClient instance for each URL ensures that any internal caching or state limitations are reset between crawl operations. This directly addresses the issue mentioned in the PR objectives where the crawler was limiting URLs to 20 even when more were requested.

@amindadgar amindadgar linked an issue Apr 2, 2025 that may be closed by this pull request
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
workflows.py (1)

47-54: Consider increasing maximum_attempts for transform_data

While you've increased the timeout for transform_data from 5 to 10 minutes, you've kept maximum_attempts at 1, meaning there are no retries for this activity if it fails.

Consider increasing maximum_attempts to match the other activities for consistency and resilience:

retry_policy=RetryPolicy(
    initial_interval=timedelta(seconds=5),
    maximum_interval=timedelta(minutes=2),
-   maximum_attempts=1,
+   maximum_attempts=3,
),
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 612e73a and eafeb6c.

📒 Files selected for processing (3)
  • hivemind_etl/website/website_etl.py (2 hunks)
  • tests/unit/test_website_etl.py (2 hunks)
  • workflows.py (4 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • hivemind_etl/website/website_etl.py
⏰ Context from checks skipped due to timeout of 90000ms (2)
  • GitHub Check: ci / test / Test
  • GitHub Check: ci / lint / Lint
🔇 Additional comments (3)
workflows.py (3)

59-65: Verify if the extremely long timeout (60 minutes) for load_data is necessary

The timeout for the load_data activity has been significantly increased from 5 to 60 minutes. While the re-instantiation of the crawler per route may result in more data being processed, this is a substantial increase.

Consider profiling the load_data activity with the new implementation to confirm whether such a long timeout is really needed or if a more moderate increase would be sufficient. Excessive timeouts can unnecessarily tie up resources.


37-43: LGTM: Increased timeout and retries for extraction are appropriate

The increased timeout (30 minutes) and retry attempts (3) for the extract_website activity align well with the changes to re-instantiate the crawler client per route. This provides more resilience and accommodates the potentially increased processing time.


89-89:

Details

❓ Verification inconclusive

Ensure workflow ID uniqueness is maintained

The workflow ID format has been changed from a timestamp-based approach to just using the community ID. This could potentially lead to conflicts if multiple workflows for the same community are started in close succession.

Consider adding a timestamp or unique identifier if there's any possibility of concurrent workflows for the same community.


🏁 Script executed:

#!/bin/bash
# Search for any code that might rely on workflow ID uniqueness
rg "website:ingestor:" -A 3 -B 3

Length of output: 436


Workflow ID Uniqueness Concern in workflows.py

At line 89 the workflow ID is set using:

id=f"website:ingestor:{community['community_id']}"

This change means that the ID now depends solely on the community ID. If there is any chance that multiple workflows for the same community might be initiated closely in time, the current scheme could lead to conflicts. Please verify whether concurrent workflows for the same community are a possibility. If they are, consider appending an additional unique element such as a timestamp or UUID to the workflow ID to guarantee uniqueness.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (1)
tests/unit/test_website_etl.py (1)

4-4: Remove unused pytest import.

The pytest module is imported but not used anywhere in the file. Consider removing this import to keep the codebase clean.

-import pytest
🧰 Tools
🪛 Ruff (0.8.2)

4-4: pytest imported but unused

Remove unused import: pytest

(F401)

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between eafeb6c and 06f0ebb.

📒 Files selected for processing (1)
  • tests/unit/test_website_etl.py (2 hunks)
🧰 Additional context used
🧬 Code Definitions (1)
tests/unit/test_website_etl.py (2)
hivemind_etl/website/crawlee_client.py (1)
  • crawl (78-116)
hivemind_etl/website/website_etl.py (1)
  • extract (31-64)
🪛 Ruff (0.8.2)
tests/unit/test_website_etl.py

4-4: pytest imported but unused

Remove unused import: pytest

(F401)

⏰ Context from checks skipped due to timeout of 90000ms (1)
  • GitHub Check: ci / test / Test
🔇 Additional comments (1)
tests/unit/test_website_etl.py (1)

33-44: LGTM! Excellent test refactoring.

The test has been properly updated to reflect the new implementation where CrawleeClient is instantiated per route. The use of patch to mock the class itself rather than the instance is the correct approach, and the additional assertions provide better test coverage by verifying both instantiation and method calls.

This implementation perfectly addresses the previous review comment and aligns with the PR objective of re-instantiating the crawler for each route.

@amindadgar amindadgar merged commit 83ad0ac into main Apr 2, 2025
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

BUG: limited to extract websites

2 participants