fix: re-instantiating the crawler per each route! #12

amindadgar · 2025-04-02T06:01:01Z

it seems it was doing some caching (or limiting the urls to 20) even in case more urls were requested to be crawled. now we're re-instantiating the crawler client which it will crawl max 20 urls per each given route.

Summary by CodeRabbit

Refactor
- Modified the extraction process to instantiate processing components on-demand for each URL.
- Enhanced logging to display specific route details during extraction, improving the clarity of operational feedback.
Bug Fixes
- Adjusted timeout durations and retry policies for various activities in workflows to improve execution flow and error handling.
Tests
- Introduced pytest module and updated the test_extract method to use mocking for improved test isolation and accuracy.

it seems it was doing some caching (or limiting the urls to 20) even in case more urls were requested to be crawled. now we're re-instantiating the crawler client which it will crawl max 20 urls per each given route.

coderabbitai · 2025-04-02T06:01:10Z

Walkthrough

The changes refactor the way the CrawleeClient is instantiated within the WebsiteETL class. The persistent client in the constructor is removed, and a new local instance is now created within the extract method for each URL processed. Additionally, the logging output in the extract method has been updated to include the specific route being crawled, thereby improving the clarity of the log messages. Modifications are also made to timeout durations and retry policies in the workflow classes.

Changes

File	Change Summary
hivemind_etl/website/website_etl.py	Moved `CrawleeClient` instantiation from class-level (in `__init__`) to method-level (in `extract`) and enhanced logging to include the specific route.
workflows.py	Adjusted timeout durations and retry policies for `extract_website`, `transform_data`, and `load_data` activities; modified `get_communities` timeout and child workflow ID format.
tests/unit/test_website_etl.py	Introduced `pytest` and modified `test_extract` method to mock `CrawleeClient` class instead of its instance, improving test isolation.

Possibly related PRs

feat: enhance crawling process with improved data extraction! #9: Adjusts the instantiation and extraction processing of CrawleeClient similar to the changes here.
feat: adding more logs to crawler! #11: Modifies the extract method, focusing on logging details and client management, which aligns closely with this PR.

Poem

In my burrow, codes now gleam,
Each extraction fresh like a dream.
New Crawlee hops in for every route,
With logs so clear, there's no doubt.
I dance through changes, soft and light,
A techy rabbit in the code-filled night!
🐇✨

✨ Finishing Touches

📝 Generate Docstrings

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

Review comments: Directly reply to a review comment made by CodeRabbit. Example:
- I pushed a fix in commit <commit_id>, please review it.
- Generate unit testing code for this file.
- Open a follow-up GitHub issue for this discussion.
Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
- @coderabbitai generate unit testing code for this file.
- @coderabbitai modularize this function.
PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
- @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
- @coderabbitai read src/utils.ts and generate unit testing code.
- @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
- @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

@coderabbitai pause to pause the reviews on a PR.
@coderabbitai resume to resume the paused reviews.
@coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
@coderabbitai full review to do a full review from scratch and review all the files again.
@coderabbitai summary to regenerate the summary of the PR.
@coderabbitai generate docstrings to generate docstrings for this PR.
@coderabbitai resolve resolve all the CodeRabbit review comments.
@coderabbitai plan to trigger planning for file edits and PR creation.
@coderabbitai configuration to show the current CodeRabbit configuration for the repository.
@coderabbitai help to get help.

Other keywords and placeholders

Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (`.coderabbit.yaml`)

You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
Please see the configuration documentation for more information.
If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

Visit our Documentation for detailed information on how to use CodeRabbit.
Join our Discord Community to get help, request features, and share feedback.
Follow us on X/Twitter for updates and announcements.

coderabbitai

Actionable comments posted: 0

🧹 Nitpick comments (2)

hivemind_etl/website/website_etl.py (2)

56-57: Improved logging and client usage.

The updated logging statement now includes the specific route being crawled, which enhances clarity and makes debugging easier. The change to use the local crawlee_client instance is consistent with the instantiation change.

However, consider adding error handling for the crawl operation to improve robustness.

 for url in urls:
     crawlee_client = CrawleeClient()
     logging.info(f"Crawling {url} and its routes!")
-    data = await crawlee_client.crawl(links=[url])
-    logging.info(f"{len(data)} data is extracted for route: {url}")
+    try:
+        data = await crawlee_client.crawl(links=[url])
+        logging.info(f"{len(data)} data is extracted for route: {url}")
+    except Exception as e:
+        logging.error(f"Error crawling {url}: {str(e)}")
+        continue
     extracted_data.extend(data)

52-65: Consider performance optimization for multiple URLs.

The current implementation processes URLs sequentially. For better performance with multiple URLs, consider implementing concurrent processing using asyncio's gathering capabilities.

 if not urls:
     raise ValueError("No URLs provided for crawling")

 extracted_data = []
-for url in urls:
-    crawlee_client = CrawleeClient()
-    logging.info(f"Crawling {url} and its routes!")
-    data = await crawlee_client.crawl(links=[url])
-    logging.info(f"{len(data)} data is extracted for route: {url}")
-    extracted_data.extend(data)
+
+async def crawl_url(url: str) -> list[dict[str, Any]]:
+    crawlee_client = CrawleeClient()
+    logging.info(f"Crawling {url} and its routes!")
+    try:
+        data = await crawlee_client.crawl(links=[url])
+        logging.info(f"{len(data)} data is extracted for route: {url}")
+        return data
+    except Exception as e:
+        logging.error(f"Error crawling {url}: {str(e)}")
+        return []
+
+import asyncio
+results = await asyncio.gather(*[crawl_url(url) for url in urls])
+for result in results:
+    extracted_data.extend(result)

This approach would crawl all URLs concurrently, potentially reducing the total execution time significantly.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between c751cfa and 612e73a.

📒 Files selected for processing (1)

hivemind_etl/website/website_etl.py (2 hunks)

⏰ Context from checks skipped due to timeout of 90000ms (1)

GitHub Check: ci / test / Test

🔇 Additional comments (2)

hivemind_etl/website/website_etl.py (2)

27-27: Good removal of the persistent client instance.

Removing the persistent CrawleeClient instance aligns with the PR objective of fixing the URL limitation issue. This change prevents the client from maintaining state across multiple crawl operations.

54-54: Effective implementation of creating a fresh client per URL.

Creating a new CrawleeClient instance for each URL ensures that any internal caching or state limitations are reset between crawl operations. This directly addresses the issue mentioned in the PR objectives where the crawler was limiting URLs to 20 even when more were requested.

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (1)

workflows.py (1)
47-54: Consider increasing maximum_attempts for transform_data

While you've increased the timeout for transform_data from 5 to 10 minutes, you've kept maximum_attempts at 1, meaning there are no retries for this activity if it fails.

Consider increasing maximum_attempts to match the other activities for consistency and resilience:
retry_policy=RetryPolicy(
    initial_interval=timedelta(seconds=5),
    maximum_interval=timedelta(minutes=2),
-   maximum_attempts=1,
+   maximum_attempts=3,
),

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 612e73a and eafeb6c.

📒 Files selected for processing (3)

hivemind_etl/website/website_etl.py (2 hunks)
tests/unit/test_website_etl.py (2 hunks)
workflows.py (4 hunks)

🚧 Files skipped from review as they are similar to previous changes (1)

hivemind_etl/website/website_etl.py

⏰ Context from checks skipped due to timeout of 90000ms (2)

GitHub Check: ci / test / Test
GitHub Check: ci / lint / Lint

🔇 Additional comments (3)

workflows.py (3)

59-65: Verify if the extremely long timeout (60 minutes) for load_data is necessary

The timeout for the load_data activity has been significantly increased from 5 to 60 minutes. While the re-instantiation of the crawler per route may result in more data being processed, this is a substantial increase.

Consider profiling the load_data activity with the new implementation to confirm whether such a long timeout is really needed or if a more moderate increase would be sufficient. Excessive timeouts can unnecessarily tie up resources.

37-43: LGTM: Increased timeout and retries for extraction are appropriate

The increased timeout (30 minutes) and retry attempts (3) for the extract_website activity align well with the changes to re-instantiate the crawler client per route. This provides more resilience and accommodates the potentially increased processing time.

89-89:
Details

❓ Verification inconclusive

Ensure workflow ID uniqueness is maintained

The workflow ID format has been changed from a timestamp-based approach to just using the community ID. This could potentially lead to conflicts if multiple workflows for the same community are started in close succession.

Consider adding a timestamp or unique identifier if there's any possibility of concurrent workflows for the same community.

🏁 Script executed:
#!/bin/bash
# Search for any code that might rely on workflow ID uniqueness
rg "website:ingestor:" -A 3 -B 3
Length of output: 436
Workflow ID Uniqueness Concern in workflows.py

At line 89 the workflow ID is set using:
id=f"website:ingestor:{community['community_id']}"
This change means that the ID now depends solely on the community ID. If there is any chance that multiple workflows for the same community might be initiated closely in time, the current scheme could lead to conflicts. Please verify whether concurrent workflows for the same community are a possibility. If they are, consider appending an additional unique element such as a timestamp or UUID to the workflow ID to guarantee uniqueness.

tests/unit/test_website_etl.py

coderabbitai

Actionable comments posted: 0

🧹 Nitpick comments (1)

tests/unit/test_website_etl.py (1)
4-4: Remove unused pytest import.

The pytest module is imported but not used anywhere in the file. Consider removing this import to keep the codebase clean.
-import pytest
🧰 Tools

🪛 Ruff (0.8.2)

4-4: pytest imported but unused

Remove unused import: pytest

(F401)

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between eafeb6c and 06f0ebb.

📒 Files selected for processing (1)

tests/unit/test_website_etl.py (2 hunks)

🧰 Additional context used

🧬 Code Definitions (1)

tests/unit/test_website_etl.py (2)

hivemind_etl/website/crawlee_client.py (1)

crawl (78-116)

hivemind_etl/website/website_etl.py (1)

extract (31-64)

🪛 Ruff (0.8.2)

tests/unit/test_website_etl.py

4-4: pytest imported but unused

Remove unused import: pytest

(F401)

⏰ Context from checks skipped due to timeout of 90000ms (1)

GitHub Check: ci / test / Test

🔇 Additional comments (1)

tests/unit/test_website_etl.py (1)

33-44: LGTM! Excellent test refactoring.

The test has been properly updated to reflect the new implementation where CrawleeClient is instantiated per route. The use of patch to mock the class itself rather than the instance is the correct approach, and the additional assertions provide better test coverage by verifying both instantiation and method calls.

This implementation perfectly addresses the previous review comment and aligns with the PR objective of re-instantiating the crawler for each route.

fix: re-instantiating the crawler per each route!

612e73a

it seems it was doing some caching (or limiting the urls to 20) even in case more urls were requested to be crawled. now we're re-instantiating the crawler client which it will crawl max 20 urls per each given route.

coderabbitai bot reviewed Apr 2, 2025

View reviewed changes

amindadgar added 3 commits April 2, 2025 10:02

fix: skipping the test case for now!

099e257

feat: increase activity run time limit & update workflow id!

71e60f0

fix: black linter issue!

eafeb6c

amindadgar linked an issue Apr 2, 2025 that may be closed by this pull request

BUG: limited to extract websites #8

Closed

coderabbitai bot reviewed Apr 2, 2025

View reviewed changes

tests/unit/test_website_etl.py Show resolved Hide resolved

feat: getting back the skipped test!

06f0ebb

coderabbitai bot reviewed Apr 2, 2025

View reviewed changes

amindadgar merged commit 83ad0ac into main Apr 2, 2025
3 checks passed

coderabbitai bot mentioned this pull request Apr 2, 2025

Fix: changing request_queue_id everytime! #13

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix: re-instantiating the crawler per each route! #12

fix: re-instantiating the crawler per each route! #12

Uh oh!

amindadgar commented Apr 2, 2025 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Apr 2, 2025 •

edited

Loading

Chat

CodeRabbit Commands (Invoked using PR comments)

Other keywords and placeholders

CodeRabbit Configuration File (`.coderabbit.yaml`)

Documentation and Community

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

fix: re-instantiating the crawler per each route! #12

fix: re-instantiating the crawler per each route! #12

Uh oh!

Conversation

amindadgar commented Apr 2, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Apr 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Possibly related PRs

Poem

Chat

CodeRabbit Commands (Invoked using PR comments)

Other keywords and placeholders

CodeRabbit Configuration File (.coderabbit.yaml)

Documentation and Community

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

amindadgar commented Apr 2, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Apr 2, 2025 •

edited

Loading

CodeRabbit Configuration File (`.coderabbit.yaml`)