-
Notifications
You must be signed in to change notification settings - Fork 11
Open
Labels
enhancementNew feature or requestNew feature or request
Milestone
Description
Description:
Enhance the existing web crawler to support crawling and extracting content from websites that rely heavily on JavaScript for rendering their content. This feature will involve integrating a headless browser to accurately render and interact with such pages.
Objectives:
- Enable the crawler to fetch and parse content from JavaScript-heavy sites.
- Use a headless browser to render JavaScript content. (explore playwright-python)
- Ensure compatibility with the existing crawler structure and options.
- Maintain the ability to switch between the default fetching method and the headless browser.
Design Considerations:
- Single Headless Browser Instance:
- Use a single instance of a headless browser to handle multiple asynchronous requests, reducing resource consumption.
- Concurrency Management:
- Utilize asyncio and a semaphore to manage concurrent requests within the same browser context.
- Integrate the asynchronous fetching logic with our existing web crawler structure.
- Error Handling:
- Ensure proper error handling and resource cleanup. (no zombie browsers, they are already headless :p)
- Fall back to default fetching mode when there is a error with the headless browser. (keep the user informed)
Metadata
Metadata
Assignees
Labels
enhancementNew feature or requestNew feature or request