-
Notifications
You must be signed in to change notification settings - Fork 0
feat: enhance crawling process with improved data extraction! #9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Extracting one url and its routes at a time, and then merging the results.
WalkthroughThe update in the Changes
Sequence Diagram(s)sequenceDiagram
participant ETL as WebsiteETL
participant Logger as L
participant Crawlee as C
loop For each URL in list
ETL->>Logger: Log "Crawling <URL> and routes"
ETL->>Crawlee: crawl(<URL>)
Crawlee-->>ETL: extracted data
ETL->>ETL: Append data to extracted_data
end
ETL->>Logger: Log "Total documents extracted: count"
Poem
📜 Recent review detailsConfiguration used: CodeRabbit UI 📒 Files selected for processing (1)
🧰 Additional context used🧬 Code Definitions (1)tests/unit/test_website_etl.py (1)
⏰ Context from checks skipped due to timeout of 90000ms (2)
🔇 Additional comments (1)
✨ Finishing Touches
🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (2)
hivemind_etl/website/website_etl.py (2)
52-57: Consider adding error handling for individual URLs.While the implementation is good, consider adding try/except blocks to handle failures for individual URLs. This would make the system more resilient by continuing to process remaining URLs even if one fails.
extracted_data = [] for url in urls: logging.info(f"Crawling {url} and its routes!") - extracted_data.extend(await self.crawlee_client.crawl(links=[url])) + try: + url_data = await self.crawlee_client.crawl(links=[url]) + extracted_data.extend(url_data) + logging.info(f"Successfully crawled {url} and extracted {len(url_data)} documents") + except Exception as e: + logging.error(f"Error crawling {url}: {str(e)}") logging.info(f"Extracted {len(extracted_data)} documents!")
52-57: Consider adding progress tracking for large URL sets.For improved visibility when processing many URLs, consider adding progress indicators in the logs.
extracted_data = [] +total_urls = len(urls) +for i, url in enumerate(urls, 1): -for url in urls: - logging.info(f"Crawling {url} and its routes!") + logging.info(f"Crawling {url} and its routes! ({i}/{total_urls})") extracted_data.extend(await self.crawlee_client.crawl(links=[url])) logging.info(f"Extracted {len(extracted_data)} documents!")
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
hivemind_etl/website/website_etl.py(2 hunks)
🧰 Additional context used
🧬 Code Definitions (1)
hivemind_etl/website/website_etl.py (1)
hivemind_etl/website/crawlee_client.py (1)
crawl(78-116)
⏰ Context from checks skipped due to timeout of 90000ms (1)
- GitHub Check: ci / test / Test
🔇 Additional comments (3)
hivemind_etl/website/website_etl.py (3)
1-1: Good addition of logging module.The addition of the logging module is appropriate for the new logging statements added to the extract method.
52-56: Good refactoring to improve crawling process.Changing from a single batch crawl call to processing URLs individually is a strategic improvement. This approach:
- Provides better visibility into the crawling process
- Makes debugging easier by isolating failures to specific URLs
- Allows for more granular progress tracking
The loop implementation is clean and effectively uses the existing crawler API.
57-57: Useful logging for extraction summary.Adding a summary log of the total number of documents extracted provides valuable information for monitoring the crawling process and verifying expected results.
Extracting one url and its routes at a time, and then merging the results.
Summary by CodeRabbit