Open
Description
crawl4ai version
0.5.0.post2
Expected Behavior
Currently, the AsyncHTTPCrawlerStrategy
strategy seems to fail when processing more than two web pages.
This is consistent across many domains, which suggests a bug rather than a specific problem with a website.
Current Behavior
The crawler raises errors when more than two pages are passed to arun_many
; the first two pages seem to be crawled successfully.
Is this reproducible?
Yes
Inputs Causing the Bug
- URL(s): https://example.com, https://en.wikipedia.org/wiki/Main_Page, https://news.ycombinator.com/, https://crawl4ai.com/mkdocs/
Steps to Reproduce
Check the code snippet below to reproduce the bug:
Code snippets
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig, HTTPCrawlerConfig
from crawl4ai.async_crawler_strategy import AsyncHTTPCrawlerStrategy
import asyncio
# Use the HTTP crawler strategy
http_crawler_config = HTTPCrawlerConfig(
method="GET",
headers={"User-Agent": "MyCustomBot/1.0"},
follow_redirects=True,
verify_ssl=True
)
async def main():
async with AsyncWebCrawler(crawler_strategy=AsyncHTTPCrawlerStrategy(browser_config=http_crawler_config)) as crawler:
config = CrawlerRunConfig(stream=False)
result = await crawler.arun_many(["https://example.com"]*3, config=config)
asyncio.run(main())
OS
macOS
Python version
3.12
Browser
Chrome
Browser version
No response
Error logs & Screenshots (if applicable)
[INIT].... → Crawl4AI 0.5.0.post2
[FETCH]... ↓ https://example.com... | Status: True | Time: 0.50s
[SCRAPE].. ◆ https://example.com... | Time: 0.003s
[COMPLETE] ● https://example.com... | Status: True | Total: 0.50s
[FETCH]... ↓ https://example.com... | Status: True | Time: 0.31s
[SCRAPE].. ◆ https://example.com... | Time: 0.001s
[COMPLETE] ● https://example.com... | Status: True | Total: 0.32s
[ERROR]... × https://example.com... | Error:
┌───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
│ × Unexpected error in _crawl_web at line 1834 in _handle_http (../../../miniconda3/envs/ai- │
│ assistant/lib/python3.12/site-packages/crawl4ai/async_crawler_strategy.py): │
│ Error: HTTP request failed: 'NoneType' object has no attribute 'connect' │
│ │
│ Code context: │
│ 1829 await self.hooks['on_error'](e) │
│ 1830 raise ConnectionTimeoutError(f"Request timed out: {str(e)}") │
│ 1831 │
│ 1832 except Exception as e: │
│ 1833 await self.hooks['on_error'](e) │
│ 1834 → raise HTTPCrawlerError(f"HTTP request failed: {str(e)}") │
│ 1835 │
│ 1836 async def crawl( │
│ 1837 self, │
│ 1838 url: str, │
│ 1839 config: Optional[CrawlerRunConfig] = None, │
└───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘
[ERROR]... × https://example.com... | Error:
┌───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
│ × Unexpected error in _crawl_web at line 1834 in _handle_http (../../../miniconda3/envs/ai- │
│ assistant/lib/python3.12/site-packages/crawl4ai/async_crawler_strategy.py): │
│ Error: HTTP request failed: 'NoneType' object has no attribute 'connect' │
│ │
│ Code context: │
│ 1829 await self.hooks['on_error'](e) │
│ 1830 raise ConnectionTimeoutError(f"Request timed out: {str(e)}") │
│ 1831 │
│ 1832 except Exception as e: │
│ 1833 await self.hooks['on_error'](e) │
│ 1834 → raise HTTPCrawlerError(f"HTTP request failed: {str(e)}") │
│ 1835 │
│ 1836 async def crawl( │
│ 1837 self, │
│ 1838 url: str, │
│ 1839 config: Optional[CrawlerRunConfig] = None, │
└───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘
Status code: 200
Content length: 1256
Activity