-
-
Notifications
You must be signed in to change notification settings - Fork 5.9k
Closed
Labels
⚙ DoneBug fix, enhancement, FR that's completed pending releaseBug fix, enhancement, FR that's completed pending release🐞 BugSomething isn't workingSomething isn't working📌 Root causedidentified the root cause of bugidentified the root cause of bug
Milestone
Description
crawl4ai version
0.7.4
Expected Behavior
The crawl4ai server can decode filter_chain and executes the deep crawl accordingly
Current Behavior
-
The docker server returns 500 if a filter_chain is specified in the deep_crawl_strategy
-
Using the RESI API, the server performs the crawl, but it does not take the filter_chain into account (the crawler crawls pages even though they are matched by the filter)
Is this reproducible?
Yes
Inputs Causing the Bug
filter_chain = [
URLPatternFilter(
patterns=["*about-us*"],
reverse=True,
),
]
crawler_config = CrawlerRunConfig(
deep_crawl_strategy=BFSDeepCrawlStrategy(
max_depth=2, filter_chain=FilterChain(filter_chain)
),
cache_mode=CacheMode.BYPASS,
)Steps to Reproduce
1. Set up Docker server
2. Use the Crawl4aiDockerClient and specify a BFSDeepCrawlStrategy with filter_chainCode snippets
# CODE FOR DOCKER CLIENT ISSUE
import asyncio
from crawl4ai import ( # Assuming you have crawl4ai installed
BrowserConfig,
CacheMode,
CrawlerRunConfig,
)
from crawl4ai.deep_crawling import (
BFSDeepCrawlStrategy,
FilterChain,
URLPatternFilter,
)
from crawl4ai.docker_client import Crawl4aiDockerClient
async def main():
# Point to the correct server port
async with Crawl4aiDockerClient(
base_url="http://xyz:11235/",
verbose=True,
) as client:
# If JWT is enabled on the server, authenticate first:
await client.authenticate(
"user@example.com"
) # See Server Configuration section
filter_chain = [
URLPatternFilter(
patterns=["*about-us*"],
reverse=True,
),
]
crawler_config = CrawlerRunConfig(
deep_crawl_strategy=BFSDeepCrawlStrategy(
max_depth=2, filter_chain=FilterChain(filter_chain)
),
cache_mode=CacheMode.BYPASS,
)
results = await client.crawl(
["https://httpbin.org/html"],
browser_config=BrowserConfig(
headless=True
), # Use library classes for config aid
crawler_config=crawler_config,
)
if results: # client.crawl returns None on failure
print(type(results))
print(f"Non-streaming results success: {results.success}")
if results.success:
for result in results: # Iterate through the CrawlResultContainer
print(result)
else:
print("Non-streaming crawl failed.")
# Example Streaming crawl
print("\n--- Running Streaming Crawl ---")
if __name__ == "__main__":
asyncio.run(main())
# CODE FOR REST API ISSUE
URL_FILTERS: list = [
"*privacy*",
"*terms*",
"*cookie*",
"*contact*",
"*support*",
"*board*",
"*blog*",
"*press*",
"*career*",
"*about*",
"*sustainability*",
"*governance*",
"*youtube*",
]
browser_config_payload = {"type": "BrowserConfig", "params": {"headless": True}}
deep_crawl_strategy_payload = {
"type": "BFSDeepCrawlStrategy",
"params": {
"max_depth": 2,
"max_pages": 10,
"include_external": True,
"filter_chain": {
"type": "FilterChain",
"params": {
"filters": [
{
"type": "URLPatternFilter",
"params": {"patterns": URL_FILTERS},
}
],
},
},
},
}
crawler_config_payload = {
"type": "CrawlerRunConfig",
"params": {
"stream": False,
"cache_mode": "bypass",
"deep_crawl_strategy": deep_crawl_strategy_payload,
}, # Use string value of enum
}
crawl_payload = {
"urls": ["https://www.drdgold.com/investors/sens-news/2023"],
"browser_config": browser_config_payload,
"crawler_config": crawler_config_payload,
}
response = httpx.post(
"http://xyz:11235/crawl", # Updated port
json=crawl_payload,
timeout=90,
)
print(f"Status Code: {response.status_code}")OS
Linux
Python version
3.12
Browser
chromium
Browser version
No response
Error logs & Screenshots (if applicable)
Traceback (most recent call last):
File "/app/api.py", line 433, in handle_crawl_request
crawler_config = CrawlerRunConfig.load(crawler_config)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/site-packages/crawl4ai/async_configs.py", line 1553, in load
config = from_serializable_dict(data)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/site-packages/crawl4ai/async_configs.py", line 148, in from_serializable_dict
k: from_serializable_dict(v) for k, v in data["params"].items()
^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/site-packages/crawl4ai/async_configs.py", line 148, in from_serializable_dict
k: from_serializable_dict(v) for k, v in data["params"].items()
^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/site-packages/crawl4ai/async_configs.py", line 148, in from_serializable_dict
k: from_serializable_dict(v) for k, v in data["params"].items()
^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/site-packages/crawl4ai/async_configs.py", line 154, in from_serializable_dict
return [from_serializable_dict(item) for item in data]
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/site-packages/crawl4ai/async_configs.py", line 150, in from_serializable_dict
return cls(**constructor_args)
^^^^^^^^^^^^^^^^^^^^^^^
TypeError: URLPatternFilter.__init__() got an unexpected keyword argument 'simple_suffixes'Metadata
Metadata
Assignees
Labels
⚙ DoneBug fix, enhancement, FR that's completed pending releaseBug fix, enhancement, FR that's completed pending release🐞 BugSomething isn't workingSomething isn't working📌 Root causedidentified the root cause of bugidentified the root cause of bug