Skip to content

Feature: Support for crawling dynamic javascript heavy site #10

@indrajithi

Description

@indrajithi

Description:

Enhance the existing web crawler to support crawling and extracting content from websites that rely heavily on JavaScript for rendering their content. This feature will involve integrating a headless browser to accurately render and interact with such pages.

Objectives:

  • Enable the crawler to fetch and parse content from JavaScript-heavy sites.
  • Use a headless browser to render JavaScript content. (explore playwright-python)
  • Ensure compatibility with the existing crawler structure and options.
  • Maintain the ability to switch between the default fetching method and the headless browser.

Design Considerations:

  • Single Headless Browser Instance:
    • Use a single instance of a headless browser to handle multiple asynchronous requests, reducing resource consumption.
  • Concurrency Management:
    • Utilize asyncio and a semaphore to manage concurrent requests within the same browser context.
    • Integrate the asynchronous fetching logic with our existing web crawler structure.
  • Error Handling:
    • Ensure proper error handling and resource cleanup. (no zombie browsers, they are already headless :p)
    • Fall back to default fetching mode when there is a error with the headless browser. (keep the user informed)

Metadata

Metadata

Assignees

Labels

enhancementNew feature or request

Type

No type

Projects

No projects

Relationships

None yet

Development

No branches or pull requests

Issue actions