Skip to content

[Bug]: AsyncHTTPCrawlerStrategy fails with arun_many when len(urls)>2 #794

Open
@perretv

Description

@perretv

crawl4ai version

0.5.0.post2

Expected Behavior

Currently, the AsyncHTTPCrawlerStrategy strategy seems to fail when processing more than two web pages.
This is consistent across many domains, which suggests a bug rather than a specific problem with a website.

Current Behavior

The crawler raises errors when more than two pages are passed to arun_many; the first two pages seem to be crawled successfully.

Is this reproducible?

Yes

Inputs Causing the Bug

- URL(s): https://example.com, https://en.wikipedia.org/wiki/Main_Page, https://news.ycombinator.com/, https://crawl4ai.com/mkdocs/

Steps to Reproduce

Check the code snippet below to reproduce the bug:

Code snippets

from crawl4ai import AsyncWebCrawler, CrawlerRunConfig, HTTPCrawlerConfig
from crawl4ai.async_crawler_strategy import AsyncHTTPCrawlerStrategy
import asyncio

# Use the HTTP crawler strategy
http_crawler_config = HTTPCrawlerConfig(
        method="GET",
        headers={"User-Agent": "MyCustomBot/1.0"},
        follow_redirects=True,
        verify_ssl=True
)

async def main():
    async with AsyncWebCrawler(crawler_strategy=AsyncHTTPCrawlerStrategy(browser_config=http_crawler_config)) as crawler:
       config = CrawlerRunConfig(stream=False)
       result = await crawler.arun_many(["https://example.com"]*3, config=config)

asyncio.run(main())

OS

macOS

Python version

3.12

Browser

Chrome

Browser version

No response

Error logs & Screenshots (if applicable)

[INIT].... → Crawl4AI 0.5.0.post2
[FETCH]... ↓ https://example.com... | Status: True | Time: 0.50s
[SCRAPE].. ◆ https://example.com... | Time: 0.003s
[COMPLETE] ● https://example.com... | Status: True | Total: 0.50s
[FETCH]... ↓ https://example.com... | Status: True | Time: 0.31s
[SCRAPE].. ◆ https://example.com... | Time: 0.001s
[COMPLETE] ● https://example.com... | Status: True | Total: 0.32s
[ERROR]... × https://example.com... | Error: 
┌───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
│ × Unexpected error in _crawl_web at line 1834 in _handle_http (../../../miniconda3/envs/ai-                           │
│ assistant/lib/python3.12/site-packages/crawl4ai/async_crawler_strategy.py):                                           │
│   Error: HTTP request failed: 'NoneType' object has no attribute 'connect'                                            │
│                                                                                                                       │
│   Code context:                                                                                                       │
│   1829                   await self.hooks['on_error'](e)                                                              │
│   1830                   raise ConnectionTimeoutError(f"Request timed out: {str(e)}")                                 │
│   1831                                                                                                                │
│   1832               except Exception as e:                                                                           │
│   1833                   await self.hooks['on_error'](e)                                                              │
│   1834 →                 raise HTTPCrawlerError(f"HTTP request failed: {str(e)}")                                     │
│   1835                                                                                                                │
│   1836       async def crawl(                                                                                         │
│   1837           self,                                                                                                │
│   1838           url: str,                                                                                            │
│   1839           config: Optional[CrawlerRunConfig] = None,                                                           │
└───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘

[ERROR]... × https://example.com... | Error: 
┌───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
│ × Unexpected error in _crawl_web at line 1834 in _handle_http (../../../miniconda3/envs/ai-                           │
│ assistant/lib/python3.12/site-packages/crawl4ai/async_crawler_strategy.py):                                           │
│   Error: HTTP request failed: 'NoneType' object has no attribute 'connect'                                            │
│                                                                                                                       │
│   Code context:                                                                                                       │
│   1829                   await self.hooks['on_error'](e)                                                              │
│   1830                   raise ConnectionTimeoutError(f"Request timed out: {str(e)}")                                 │
│   1831                                                                                                                │
│   1832               except Exception as e:                                                                           │
│   1833                   await self.hooks['on_error'](e)                                                              │
│   1834 →                 raise HTTPCrawlerError(f"HTTP request failed: {str(e)}")                                     │
│   1835                                                                                                                │
│   1836       async def crawl(                                                                                         │
│   1837           self,                                                                                                │
│   1838           url: str,                                                                                            │
│   1839           config: Optional[CrawlerRunConfig] = None,                                                           │
└───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘

Status code: 200
Content length: 1256

Activity

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Metadata

Assignees

No one assigned

    Labels

    🐞 BugSomething isn't working🩺 Needs TriageNeeds attention of maintainers

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions