Skip to content

[Bug]: Docker server is not decoding or applying filter_chain #1419

@Sjoeborg

Description

@Sjoeborg

crawl4ai version

0.7.4

Expected Behavior

The crawl4ai server can decode filter_chain and executes the deep crawl accordingly

Current Behavior

  • The docker server returns 500 if a filter_chain is specified in the deep_crawl_strategy

  • Using the RESI API, the server performs the crawl, but it does not take the filter_chain into account (the crawler crawls pages even though they are matched by the filter)

Is this reproducible?

Yes

Inputs Causing the Bug

filter_chain = [
            URLPatternFilter(
                patterns=["*about-us*"],
                reverse=True,
            ),
        ]
        crawler_config = CrawlerRunConfig(
            deep_crawl_strategy=BFSDeepCrawlStrategy(
                max_depth=2, filter_chain=FilterChain(filter_chain)
            ),
            cache_mode=CacheMode.BYPASS,
        )

Steps to Reproduce

1. Set up Docker server
2. Use the Crawl4aiDockerClient and specify a BFSDeepCrawlStrategy with filter_chain

Code snippets

# CODE FOR DOCKER CLIENT ISSUE
import asyncio

from crawl4ai import (  # Assuming you have crawl4ai installed
    BrowserConfig,
    CacheMode,
    CrawlerRunConfig,
)
from crawl4ai.deep_crawling import (
    BFSDeepCrawlStrategy,
    FilterChain,
    URLPatternFilter,
)
from crawl4ai.docker_client import Crawl4aiDockerClient


async def main():
    # Point to the correct server port
    async with Crawl4aiDockerClient(
        base_url="http://xyz:11235/",
        verbose=True,
    ) as client:
        # If JWT is enabled on the server, authenticate first:
        await client.authenticate(
            "user@example.com"
        )  # See Server Configuration section

        filter_chain = [
            URLPatternFilter(
                patterns=["*about-us*"],
                reverse=True,
            ),
        ]
        crawler_config = CrawlerRunConfig(
            deep_crawl_strategy=BFSDeepCrawlStrategy(
                max_depth=2, filter_chain=FilterChain(filter_chain)
            ),
            cache_mode=CacheMode.BYPASS,
        )

        results = await client.crawl(
            ["https://httpbin.org/html"],
            browser_config=BrowserConfig(
                headless=True
            ),  # Use library classes for config aid
            crawler_config=crawler_config,
        )
        if results:  # client.crawl returns None on failure
            print(type(results))
            print(f"Non-streaming results success: {results.success}")
            if results.success:
                for result in results:  # Iterate through the CrawlResultContainer
                    print(result)
        else:
            print("Non-streaming crawl failed.")

        # Example Streaming crawl
        print("\n--- Running Streaming Crawl ---")


if __name__ == "__main__":
    asyncio.run(main())


# CODE FOR REST API ISSUE
URL_FILTERS: list = [
    "*privacy*",
    "*terms*",
    "*cookie*",
    "*contact*",
    "*support*",
    "*board*",
    "*blog*",
    "*press*",
    "*career*",
    "*about*",
    "*sustainability*",
    "*governance*",
    "*youtube*",
]
browser_config_payload = {"type": "BrowserConfig", "params": {"headless": True}}
deep_crawl_strategy_payload = {
    "type": "BFSDeepCrawlStrategy",
    "params": {
        "max_depth": 2,
        "max_pages": 10,
        "include_external": True,
        "filter_chain": {
            "type": "FilterChain",
            "params": {
                "filters": [
                    {
                        "type": "URLPatternFilter",
                        "params": {"patterns": URL_FILTERS},
                    }
                ],
            },
        },
    },
}
crawler_config_payload = {
    "type": "CrawlerRunConfig",
    "params": {
        "stream": False,
        "cache_mode": "bypass",
        "deep_crawl_strategy": deep_crawl_strategy_payload,
    },  # Use string value of enum
}


crawl_payload = {
    "urls": ["https://www.drdgold.com/investors/sens-news/2023"],
    "browser_config": browser_config_payload,
    "crawler_config": crawler_config_payload,
}
response = httpx.post(
    "http://xyz:11235/crawl",  # Updated port
    json=crawl_payload,
    timeout=90,
)
print(f"Status Code: {response.status_code}")

OS

Linux

Python version

3.12

Browser

chromium

Browser version

No response

Error logs & Screenshots (if applicable)

Traceback (most recent call last):
  File "/app/api.py", line 433, in handle_crawl_request
    crawler_config = CrawlerRunConfig.load(crawler_config)
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/crawl4ai/async_configs.py", line 1553, in load
    config = from_serializable_dict(data)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/crawl4ai/async_configs.py", line 148, in from_serializable_dict
    k: from_serializable_dict(v) for k, v in data["params"].items()
       ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/crawl4ai/async_configs.py", line 148, in from_serializable_dict
    k: from_serializable_dict(v) for k, v in data["params"].items()
       ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/crawl4ai/async_configs.py", line 148, in from_serializable_dict
    k: from_serializable_dict(v) for k, v in data["params"].items()
       ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/crawl4ai/async_configs.py", line 154, in from_serializable_dict
    return [from_serializable_dict(item) for item in data]
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/crawl4ai/async_configs.py", line 150, in from_serializable_dict
    return cls(**constructor_args)
           ^^^^^^^^^^^^^^^^^^^^^^^
TypeError: URLPatternFilter.__init__() got an unexpected keyword argument 'simple_suffixes'

Metadata

Metadata

Assignees

Labels

⚙ DoneBug fix, enhancement, FR that's completed pending release🐞 BugSomething isn't working📌 Root causedidentified the root cause of bug

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions