Skip to content

[Bug]: Proxy Not Working with proxy_config option #604

Open
@SashaGordin

Description

crawl4ai version

Version: 0.4.248

Expected Behavior

The expected behavior is for the scraper to grab
[
{
"listingUrl": "https://www.airbnb.com/rooms/5862910?adults=1&category_tag=Tag%3A8678&enable_m3_private_room=true&photo_id=1713733560&search_mode=regular_search&check_in=2025-02-09&check_out=2025-02-12&source_impression_id=p3_1738548815_P3f4RWmpcjMAY_yu&previous_page_section_name=1000&federated_search_id=e010a89c-aefa-4ef8-8979-c86350c281d7",
"listingTitle": "Private King Room-Shared Bath",
"listingLocation": "San Diego, California, United States",
"hostNameOnPropertyPage": "Stay with Sam",
"hostProfileLinkOnPropertyPage": "/users/show/7597786",
"hostWork": "Lives in San Diego, CA",
"hostAbout": null,
"hostLocation": null
}
]

this data from the publicly available data from Airbnb.

Current Behavior

The scraper works, however, when I add a proxy instead of logging the correct output, I get timeout errors.

Is this reproducible?

Yes

Inputs Causing the Bug

'proxy_config': { 
            'server': 'residential-proxy.scrapeops.io:8181?'
            'username': 'scrapeops',
            'password': 'SCRAPE_OPS_PASSWORD',
            'auth_type': 'basic'
        },


This is defined within my config variable.

config = {
    'initialUrl': 'https://www.airbnb.com/s/San-Diego/homes',
    'selectors': {[SELECTORS_DEFINED_HERE]},
    'browserConfig': {
        'headless': False,
        'verbose': True,
        'proxy_config': {[shown above]},
        'extra_args': ['--disable-blink-features=AutomationControlled', '--disable-images', '--disable-dev-shm-usage']
    },
    'maxListingsToScrape': 1,
    'cityToSearch': 'San Diego'
}

Steps to Reproduce

Code snippets

How I define the crawler with browser_config 

browser_config = BrowserConfig(**config['browserConfig'])
        async with AsyncWebCrawler(config=browser_config) as crawler:

OS

macOS

Python version

Python 3.9.6

Browser

No response

Browser version

No response

Error logs & Screenshots (if applicable)

This is the expected behavior log:

INFO:main:Starting Airbnb scraper
[INIT].... → Crawl4AI 0.4.247
INFO:main:Delay after search results page load: 8.16s
[FETCH]... ↓ Raw HTML... | Status: True | Time: 0.00s
[SCRAPE].. ◆ Processed Raw HTML... | Time: 18353ms
[EXTRACT]. ■ Completed for Raw HTML... | Time: 0.15768066699999395s
[COMPLETE] ● Raw HTML... | Status: True | Total: 18.52s
INFO:main:Found 24 listings
INFO:main:Found 24 listing URLs
INFO:main:Initial delay before listing https://www.airbnb.com/rooms/5862910?adults=1&category_tag=Tag%3A8678&enable_m3_private_room=true&photo_id=1713733560&search_mode=regular_search&check_in=2025-02-09&check_out=2025-02-12&source_impression_id=p3_1738548815_P3f4RWmpcjMAY_yu&previous_page_section_name=1000&federated_search_id=e010a89c-aefa-4ef8-8979-c86350c281d7: 1.72s
DOM content loaded after script execution in 0.00937199592590332
[FETCH]... ↓ https://www.airbnb.com/rooms/5862910?adults=1&cate... | Status: True | Time: 33.88s
[SCRAPE].. ◆ Processed https://www.airbnb.com/rooms/5862910?adults=1&cate... | Time: 165ms
[EXTRACT]. ■ Completed for https://www.airbnb.com/rooms/5862910?adults=1&cate... | Time: 0.12295133399999258s
[COMPLETE] ● https://www.airbnb.com/rooms/5862910?adults=1&cate... | Status: True | Total: 34.17s
INFO:main:Extracted Content for https://www.airbnb.com/rooms/5862910?adults=1&category_tag=Tag%3A8678&enable_m3_private_room=true&photo_id=1713733560&search_mode=regular_search&check_in=2025-02-09&check_out=2025-02-12&source_impression_id=p3_1738548815_P3f4RWmpcjMAY_yu&previous_page_section_name=1000&federated_search_id=e010a89c-aefa-4ef8-8979-c86350c281d7 BEFORE JSON LOAD: [
{
"listingTitle": "Private King Room-Shared Bath",
"listingLocation": "San Diego, California, United States",
"hostNameOnPropertyPage": "Stay with Sam",
"hostProfileLinkOnPropertyPage": "/users/show/7597786"
}
]
INFO:main:Delay before host profile page: 11.00s
[FETCH]... ↓ https://www.airbnb.com/users/show/7597786... | Status: True | Time: 6.89s
[SCRAPE].. ◆ Processed https://www.airbnb.com/users/show/7597786... | Time: 75ms
[EXTRACT]. ■ Completed for https://www.airbnb.com/users/show/7597786... | Time: 0.030629540999996152s
[COMPLETE] ● https://www.airbnb.com/users/show/7597786... | Status: True | Total: 7.00s
INFO:main:Successfully processed 1 listings
INFO:main:Final data:
[
{
"listingUrl": "https://www.airbnb.com/rooms/5862910?adults=1&category_tag=Tag%3A8678&enable_m3_private_room=true&photo_id=1713733560&search_mode=regular_search&check_in=2025-02-09&check_out=2025-02-12&source_impression_id=p3_1738548815_P3f4RWmpcjMAY_yu&previous_page_section_name=1000&federated_search_id=e010a89c-aefa-4ef8-8979-c86350c281d7",
"listingTitle": "Private King Room-Shared Bath",
"listingLocation": "San Diego, California, United States",
"hostNameOnPropertyPage": "Stay with Sam",
"hostProfileLinkOnPropertyPage": "/users/show/7597786",
"hostWork": "Lives in San Diego, CA",
"hostAbout": null,
"hostLocation": null
}
]
INFO:main:Scraping completed

This is the logs of the failed output for when I add the proxy server:
INFO:main:Starting Airbnb scraper
[INIT].... → Crawl4AI 0.4.247
INFO:main:Delay after search results page load: 10.42s
[FETCH]... ↓ Raw HTML... | Status: True | Time: 0.01s
[SCRAPE].. ◆ Processed Raw HTML... | Time: 17973ms
[EXTRACT]. ■ Completed for Raw HTML... | Time: 0.1515022909999999s
[COMPLETE] ● Raw HTML... | Status: True | Total: 18.14s
INFO:main:Found 24 listings
INFO:main:Found 24 listing URLs
INFO:main:Initial delay before listing https://www.airbnb.com/rooms/5862910?adults=1&category_tag=Tag%3A8678&enable_m3_private_room=true&photo_id=1713733560&search_mode=regular_search&check_in=2025-02-09&check_out=2025-02-12&source_impression_id=p3_1738549529_P3mfSH1EV8Q_lbPE&previous_page_section_name=1000&federated_search_id=94cf8485-6d72-448d-be25-56c008367fe4: 5.37s
[ERROR]... × https://www.airbnb.com/rooms/5862910?adults=1&cate... | Error:
┌───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
│ × Unexpected error in _crawl_web at line 1205 in _crawl_web (crawl4ai/async_crawler_strategy.py): │
│ Error: Failed on navigating ACS-GOTO: │
│ Page.goto: Timeout 240000ms exceeded. │
│ Call log: │
│ - navigating to "https://www.airbnb.com/rooms/5862910?adults=1&category_tag=Tag%3A8678&enable_m3_private_room=true
│ &photo_id=1713733560&search_mode=regular_search&check_in=2025-02-09&check_out=2025-02-12&source_impression_id=p3_173 │
│ 8549529_P3mfSH1EV8Q_lbPE&previous_page_section_name=1000&federated_search_id=94cf8485-6d72-448d-be25-56c008367fe4", │
│ waiting until "networkidle" │
│ │
│ │
│ Code context: │
│ 1200 │
│ 1201 response = await page.goto( │
│ 1202 url, wait_until=config.wait_until, timeout=config.page_timeout │
│ 1203 ) │
│ 1204 except Error as e: │
│ 1205 → raise RuntimeError(f"Failed on navigating ACS-GOTO:\n{str(e)}") │
│ 1206 │
│ 1207 await self.execute_hook("after_goto", page, context=context, url=url, response=response) │
│ 1208 │
│ 1209 if response is None: │
│ 1210 status_code = 200 │
└───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘

INFO:main:Extracted Content for https://www.airbnb.com/rooms/5862910?adults=1&category_tag=Tag%3A8678&enable_m3_private_room=true&photo_id=1713733560&search_mode=regular_search&check_in=2025-02-09&check_out=2025-02-12&source_impression_id=p3_1738549529_P3mfSH1EV8Q_lbPE&previous_page_section_name=1000&federated_search_id=94cf8485-6d72-448d-be25-56c008367fe4 BEFORE JSON LOAD: None
WARNING:main:Failed to extract property data for https://www.airbnb.com/rooms/5862910?adults=1&category_tag=Tag%3A8678&enable_m3_private_room=true&photo_id=1713733560&search_mode=regular_search&check_in=2025-02-09&check_out=2025-02-12&source_impression_id=p3_1738549529_P3mfSH1EV8Q_lbPE&previous_page_section_name=1000&federated_search_id=94cf8485-6d72-448d-be25-56c008367fe4
ERROR:main:Main process failed: list index out of range
INFO:main:Scraping completed

Metadata

Assignees

No one assigned

    Labels

    🐞 BugSomething isn't working🩺 Needs TriageNeeds attention of maintainers

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions