Description
crawl4ai version
Version: 0.4.248
Expected Behavior
The expected behavior is for the scraper to grab
[
{
"listingUrl": "https://www.airbnb.com/rooms/5862910?adults=1&category_tag=Tag%3A8678&enable_m3_private_room=true&photo_id=1713733560&search_mode=regular_search&check_in=2025-02-09&check_out=2025-02-12&source_impression_id=p3_1738548815_P3f4RWmpcjMAY_yu&previous_page_section_name=1000&federated_search_id=e010a89c-aefa-4ef8-8979-c86350c281d7",
"listingTitle": "Private King Room-Shared Bath",
"listingLocation": "San Diego, California, United States",
"hostNameOnPropertyPage": "Stay with Sam",
"hostProfileLinkOnPropertyPage": "/users/show/7597786",
"hostWork": "Lives in San Diego, CA",
"hostAbout": null,
"hostLocation": null
}
]
this data from the publicly available data from Airbnb.
Current Behavior
The scraper works, however, when I add a proxy instead of logging the correct output, I get timeout errors.
Is this reproducible?
Yes
Inputs Causing the Bug
'proxy_config': {
'server': 'residential-proxy.scrapeops.io:8181?'
'username': 'scrapeops',
'password': 'SCRAPE_OPS_PASSWORD',
'auth_type': 'basic'
},
This is defined within my config variable.
config = {
'initialUrl': 'https://www.airbnb.com/s/San-Diego/homes',
'selectors': {[SELECTORS_DEFINED_HERE]},
'browserConfig': {
'headless': False,
'verbose': True,
'proxy_config': {[shown above]},
'extra_args': ['--disable-blink-features=AutomationControlled', '--disable-images', '--disable-dev-shm-usage']
},
'maxListingsToScrape': 1,
'cityToSearch': 'San Diego'
}
Steps to Reproduce
Code snippets
How I define the crawler with browser_config
browser_config = BrowserConfig(**config['browserConfig'])
async with AsyncWebCrawler(config=browser_config) as crawler:
OS
macOS
Python version
Python 3.9.6
Browser
No response
Browser version
No response
Error logs & Screenshots (if applicable)
This is the expected behavior log:
INFO:main:Starting Airbnb scraper
[INIT].... → Crawl4AI 0.4.247
INFO:main:Delay after search results page load: 8.16s
[FETCH]... ↓ Raw HTML... | Status: True | Time: 0.00s
[SCRAPE].. ◆ Processed Raw HTML... | Time: 18353ms
[EXTRACT]. ■ Completed for Raw HTML... | Time: 0.15768066699999395s
[COMPLETE] ● Raw HTML... | Status: True | Total: 18.52s
INFO:main:Found 24 listings
INFO:main:Found 24 listing URLs
INFO:main:Initial delay before listing https://www.airbnb.com/rooms/5862910?adults=1&category_tag=Tag%3A8678&enable_m3_private_room=true&photo_id=1713733560&search_mode=regular_search&check_in=2025-02-09&check_out=2025-02-12&source_impression_id=p3_1738548815_P3f4RWmpcjMAY_yu&previous_page_section_name=1000&federated_search_id=e010a89c-aefa-4ef8-8979-c86350c281d7: 1.72s
DOM content loaded after script execution in 0.00937199592590332
[FETCH]... ↓ https://www.airbnb.com/rooms/5862910?adults=1&cate... | Status: True | Time: 33.88s
[SCRAPE].. ◆ Processed https://www.airbnb.com/rooms/5862910?adults=1&cate... | Time: 165ms
[EXTRACT]. ■ Completed for https://www.airbnb.com/rooms/5862910?adults=1&cate... | Time: 0.12295133399999258s
[COMPLETE] ● https://www.airbnb.com/rooms/5862910?adults=1&cate... | Status: True | Total: 34.17s
INFO:main:Extracted Content for https://www.airbnb.com/rooms/5862910?adults=1&category_tag=Tag%3A8678&enable_m3_private_room=true&photo_id=1713733560&search_mode=regular_search&check_in=2025-02-09&check_out=2025-02-12&source_impression_id=p3_1738548815_P3f4RWmpcjMAY_yu&previous_page_section_name=1000&federated_search_id=e010a89c-aefa-4ef8-8979-c86350c281d7 BEFORE JSON LOAD: [
{
"listingTitle": "Private King Room-Shared Bath",
"listingLocation": "San Diego, California, United States",
"hostNameOnPropertyPage": "Stay with Sam",
"hostProfileLinkOnPropertyPage": "/users/show/7597786"
}
]
INFO:main:Delay before host profile page: 11.00s
[FETCH]... ↓ https://www.airbnb.com/users/show/7597786... | Status: True | Time: 6.89s
[SCRAPE].. ◆ Processed https://www.airbnb.com/users/show/7597786... | Time: 75ms
[EXTRACT]. ■ Completed for https://www.airbnb.com/users/show/7597786... | Time: 0.030629540999996152s
[COMPLETE] ● https://www.airbnb.com/users/show/7597786... | Status: True | Total: 7.00s
INFO:main:Successfully processed 1 listings
INFO:main:Final data:
[
{
"listingUrl": "https://www.airbnb.com/rooms/5862910?adults=1&category_tag=Tag%3A8678&enable_m3_private_room=true&photo_id=1713733560&search_mode=regular_search&check_in=2025-02-09&check_out=2025-02-12&source_impression_id=p3_1738548815_P3f4RWmpcjMAY_yu&previous_page_section_name=1000&federated_search_id=e010a89c-aefa-4ef8-8979-c86350c281d7",
"listingTitle": "Private King Room-Shared Bath",
"listingLocation": "San Diego, California, United States",
"hostNameOnPropertyPage": "Stay with Sam",
"hostProfileLinkOnPropertyPage": "/users/show/7597786",
"hostWork": "Lives in San Diego, CA",
"hostAbout": null,
"hostLocation": null
}
]
INFO:main:Scraping completed
This is the logs of the failed output for when I add the proxy server:
INFO:main:Starting Airbnb scraper
[INIT].... → Crawl4AI 0.4.247
INFO:main:Delay after search results page load: 10.42s
[FETCH]... ↓ Raw HTML... | Status: True | Time: 0.01s
[SCRAPE].. ◆ Processed Raw HTML... | Time: 17973ms
[EXTRACT]. ■ Completed for Raw HTML... | Time: 0.1515022909999999s
[COMPLETE] ● Raw HTML... | Status: True | Total: 18.14s
INFO:main:Found 24 listings
INFO:main:Found 24 listing URLs
INFO:main:Initial delay before listing https://www.airbnb.com/rooms/5862910?adults=1&category_tag=Tag%3A8678&enable_m3_private_room=true&photo_id=1713733560&search_mode=regular_search&check_in=2025-02-09&check_out=2025-02-12&source_impression_id=p3_1738549529_P3mfSH1EV8Q_lbPE&previous_page_section_name=1000&federated_search_id=94cf8485-6d72-448d-be25-56c008367fe4: 5.37s
[ERROR]... × https://www.airbnb.com/rooms/5862910?adults=1&cate... | Error:
┌───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
│ × Unexpected error in _crawl_web at line 1205 in _crawl_web (crawl4ai/async_crawler_strategy.py): │
│ Error: Failed on navigating ACS-GOTO: │
│ Page.goto: Timeout 240000ms exceeded. │
│ Call log: │
│ - navigating to "https://www.airbnb.com/rooms/5862910?adults=1&category_tag=Tag%3A8678&enable_m3_private_room=true │
│ &photo_id=1713733560&search_mode=regular_search&check_in=2025-02-09&check_out=2025-02-12&source_impression_id=p3_173 │
│ 8549529_P3mfSH1EV8Q_lbPE&previous_page_section_name=1000&federated_search_id=94cf8485-6d72-448d-be25-56c008367fe4", │
│ waiting until "networkidle" │
│ │
│ │
│ Code context: │
│ 1200 │
│ 1201 response = await page.goto( │
│ 1202 url, wait_until=config.wait_until, timeout=config.page_timeout │
│ 1203 ) │
│ 1204 except Error as e: │
│ 1205 → raise RuntimeError(f"Failed on navigating ACS-GOTO:\n{str(e)}") │
│ 1206 │
│ 1207 await self.execute_hook("after_goto", page, context=context, url=url, response=response) │
│ 1208 │
│ 1209 if response is None: │
│ 1210 status_code = 200 │
└───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘
INFO:main:Extracted Content for https://www.airbnb.com/rooms/5862910?adults=1&category_tag=Tag%3A8678&enable_m3_private_room=true&photo_id=1713733560&search_mode=regular_search&check_in=2025-02-09&check_out=2025-02-12&source_impression_id=p3_1738549529_P3mfSH1EV8Q_lbPE&previous_page_section_name=1000&federated_search_id=94cf8485-6d72-448d-be25-56c008367fe4 BEFORE JSON LOAD: None
WARNING:main:Failed to extract property data for https://www.airbnb.com/rooms/5862910?adults=1&category_tag=Tag%3A8678&enable_m3_private_room=true&photo_id=1713733560&search_mode=regular_search&check_in=2025-02-09&check_out=2025-02-12&source_impression_id=p3_1738549529_P3mfSH1EV8Q_lbPE&previous_page_section_name=1000&federated_search_id=94cf8485-6d72-448d-be25-56c008367fe4
ERROR:main:Main process failed: list index out of range
INFO:main:Scraping completed