Skip to content

Crawler raising exception and dying when respecting robots.txt #1563

@ericvg97

Description

@ericvg97

I was crawling techcrunch.com while respecting robots.txt and got these two errors (two different traces showed) in different moments (which kill the crawler). It looks like a transient error as I am not able to reproduce it but it happens very often, after just crawling a few pages. I am using a proxy in case it helps

  1. Error 1
                                                                                                                                                                                        │
│ INFO  Processing https://techcrunch.com/2008/03/19/review-rock-band-the-operative-word-is-rock/ ...                                                                                                                                                        │
│ [__main__] ERROR Task crawler failed with exception                                                                                                                                                                                                                          │
│       NoneType: None                                                                                                                                                                                                                                                         │
│ [__main__] ERROR Error in concurrent tasks: Failed to connect to the server.                                                                                                                                                                                                 │
│ Reason: hyper_util::client::legacy::Error( 
    
│       NoneType: None                                                                                                                                                                                                                                                         │
│ [__main__] ERROR Error in concurrent tasks: Failed to connect to the server.                                                                                                                                                                                                 │
│ Reason: hyper_util::client::legacy::Error(                                                                                                                                                                                                                                   │
│     Connect,                                                                                                                                                                                                                                                                 │
│     Custom {                                                                                                                                                                                                                                                                 │
│         kind: Other,                                                                                                                                                                                                                                                         │
│         error: Custom {                                                                                                                                                                                                                                                      │
│             kind: InvalidData,                                                                                                                                                                                                                                               │
│             error: InvalidCertificate(                                                                                                                                                                                                                                       │
│                 NotValidForNameContext {                                                                                                                                                                                                                                     │
│                     expected: DnsName(                                                                                                                                                                                                                                       │
│                         "apply.techcrunch.com",                                                                                                                                                                                                                              │
│                     ),                                                                                                                                                                                                                                                       │
│                     presented: [                                                                                                                                                                                                                                             │
│                         "DnsName(\"www.makers.com\")",                                                                                                                                                                                                                       │
│                         "DnsName(\"www.intheknow.com\")",                                                                                                                                                                                                                    │
│                         "DnsName(\"www.builtbygirls.com\")",                                                                                                                                                                                                                 │
│                         "DnsName(\"www.aol.jp\")",                                                                                                                                                                                                                           │
│                         "DnsName(\"www.aol.de\")",                                                                                                                                                                                                                           │
│                         "DnsName(\"www.aol.co.uk\")",                                                                                                                                                                                                                        │
│                         "DnsName(\"www.aol.ca\")",                                                                                                                                                                                                                           │
│                         "DnsName(\"welcomescreen.aol.de\")",                                                                                                                                                                                                                 │
│                         "DnsName(\"wave.builtbygirls.com\")",                                                                                                                                                                                                                │
│                         "DnsName(\"wave-stage.builtbygirls.com\")",                                                                                                                                                                                                          │
│                         "DnsName(\"w.sb.welcomescreen.aol.com\")",                                                                                                                                                                                                           │
│                         "DnsName(\"w.main.welcomescreen.aol.com\")",                                                                                                                                                                                                         │
│                         "DnsName(\"venta.automoviles.aol.com\")",                                                                                                                                                                                                            │
│                         "DnsName(\"toshiba.aol.ca\")",                                                                                                                                                                                                                       │
│                         "DnsName(\"talktalk.aol.co.uk\")",                                                                                                                                                                                                                   │
│                         "DnsName(\"support.builtbygirls.com\")",                                                                                                                                                                                                             │
│                         "DnsName(\"shop.intheknow.com\")",                                                                                                                                                                                                                   │
│                         "DnsName(\"premium.yahoofinance.com\")",                                                                                                                                                                                                             │
│                         "DnsName(\"o2.welcomescreen.aol.de\")",                                                                                                                                                                                                              │
│                         "DnsName(\"o2.aol.de\")",                                                                                                                                                                                                                            │
│                         "DnsName(\"news.aol.jp\")",                                                                                                                                                                                                                          │
│                         "DnsName(\"n.sb.welcomescreen.aol.com\")",                                                                                                                                                                                                           │
│                         "DnsName(\"n.main.welcomescreen.aol.com\")",                                                                                                                                                                                                         │
│                         "DnsName(\"fluxible.io.yahoo.net\")",                                                                                                                                                                                                                │
│                         "DnsName(\"engadget.com\")",                                                                                                                                                                                                                         │
│                         "DnsName(\"didomi.makers.com\")",                                                                                                                                                                                                                    │
│                         "DnsName(\"didomi.aol.de\")",                                                                                                                                                                                                                        │
│                         "DnsName(\"brb.yahoo.net\")",                                                                                                                                                                                                                        │
│                         "DnsName(\"aolbroadband.welcomescreen.aol.co.uk\")",                                                                                                                                                                                                 │
│                         "DnsName(\"aol.com\")",                                                                                                                                                                                                                              │
│                         "DnsName(\"acss.io.yahoo.net\")",                                                                                                                                                                                                                    │
│                         "DnsName(\"*.www.aol.com\")",                                                                                                                                                                                                                        │
│                         "DnsName(\"*.shop.intheknow.com\")",                                                                                                                                                                                                                 │
│                         "DnsName(\"*.engadget.com\")",                                                                                                                                                                                                                       │
│                         "DnsName(\"*.cashay.com\")",                                                                                                                                                                                                                         │
│                         "DnsName(\"*.aol.com\")",                                                                                                                                                                                                                            │
│                     ],                                                                                                                                                                                                                                                       │
│                 },                                                                                                                                                                                                                                                           │
│             ),                                                                                                                                                                                                                                                               │
│         },                                                                                                                                                                                                                                                                   │
│     },                                                                                                                                                                                                                                                                       │
│ )                                                                                                                                                                                                                                                                            │
│ │ Traceback (most recent call last):                                                                                                                                                                                                                                           │
│   File "/usr/local/lib/python3.12/asyncio/runners.py", line 195, in run                                                                                                                                                                                                      │
│     return runner.run(main)                                                                                                                                                                                                                                                  │
│            ^^^^^^^^^^^^^^^^                                                                                                                                                                                                                                                  │
│   File "/usr/local/lib/python3.12/asyncio/runners.py", line 118, in run                                                                                                                                                                                                      │
│     return self._loop.run_until_complete(task)                                                                                                                                                                                                                               │
│            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^                                                                                                                                                                                                                               │
│   File "/usr/local/lib/python3.12/asyncio/base_events.py", line 691, in run_until_complete                                                                                                                                                                                   │
│     return future.result()                                                                                                                                                                                                                                                   │
│            ^^^^^^^^^^^^^^^                                                                                                                                                                                                                                                   │
│   File "/main.py", line 90, in run_concurrent_tasks                                                                                                                                                                                                              │
│     raise exception                                                                                                                                                                                                                                                          │
│   File "/crawler.py", line 69, in crawl                                                                                                                                                                                                                          │
│     await self.crawler.run()                                                                                                                                                                                                                                                 │
│   File "/crawlers/_basic/_basic_crawler.py", line 697, in run                                                                                                                                                                 │
│     await run_task                                                                                                                                                                                                                                                           │
│   File "/crawlers/_basic/_basic_crawler.py", line 752, in _run_crawler                                                                                                                                                        │
│     await self._autoscaled_pool.run()                                                                                                                                                                                                                                        │
│   File "/_autoscaling/autoscaled_pool.py", line 126, in run                                                                                                                                                                   │
│     await run.result                                                                                                                                                                                                                                                         │
│   File "/_autoscaling/autoscaled_pool.py", line 277, in _worker_task                                                                                                                                                          │
│     await asyncio.wait_for(                                                                                                                                                                                                                                                  │
│   File "/usr/local/lib/python3.12/asyncio/tasks.py", line 520, in wait_for                                                                                                                                                                                                   │
│     return await fut                                                                                                                                                                                                                                                         │
│            ^^^^^^^^^                                                                                                                                                                                                                                                         │
│   File "/crawlers/_basic/_basic_crawler.py", line 1366, in __run_task_function                                                                                                                                                │
│     if not (await self._is_allowed_based_on_robots_txt_file(request.url)):                                                                                                                                                                                                   │
│             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^                                                                                                                                                                                                     │
│   File "/crawlers/_basic/_basic_crawler.py", line 1566, in _is_allowed_based_on_robots_txt_file                                                                                                                               │
│     robots_txt_file = await self._get_robots_txt_file_for_url(url)                                                                                                                                                                                                           │
│                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^                                                                                                                                                                                                           │
│   File "/crawlers/_basic/_basic_crawler.py", line 1589, in _get_robots_txt_file_for_url                                                                                                                                       │
│     robots_txt_file = await self._find_txt_file_for_url(url)                                                                                                                                                                                                                 │
│                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^                                                                                                                                                                                                                 │
│   File "/crawlers/_basic/_basic_crawler.py", line 1599, in _find_txt_file_for_url                                                                                                                                             │
│     return await RobotsTxtFile.find(url, self._http_client)                                                                                                                                                                                                                  │
│            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^                                                                                                                                                                                                                  │
│   File "/_utils/robots.py", line 48, in find                                                                                                                                                                                  │
│     return await cls.load(str(robots_url), http_client, proxy_info)                                                                                                                                                                                                          │
│            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^                                                                                                                                                                                                          │
│   File "/_utils/robots.py", line 59, in load                                                                                                                                                                                  │
│     response = await http_client.send_request(url, proxy_info=proxy_info)                                                                                                                                                                                                    │
│                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^                                                                                                                                                                                                    │
│   File "/http_clients/_impit.py", line 167, in send_request                                                                                                                                                                   │
│     response = await client.request(                                                                                                                                                                                                                                         │
│                ^^^^^^^^^^^^^^^^^^^^^  
│                ^^^^^^^^^^^^^^^^^^^^^                                                                                                                                                                                                                                         │
│ impit.ConnectError: Failed to connect to the server.                                                                                                                                                                                                                         │
│ Reason: hyper_util::client::legacy::Error(                                                                                                                                                                                                                                   │
│     Connect,                                                                                                                                                                                                                                                                 │
│     Custom {                                                                                                                                                                                                                                                                 │
│         kind: Other,                                                                                                                                                                                                                                                         │
│         error: Custom {                                                                                                                                                                                                                                                      │
│             kind: InvalidData,                                                                                                                                                                                                                                               │
│             error: InvalidCertificate(                                                                                                                                                                                                                                       │
│                 NotValidForNameContext {                                                                                                                                                                                                                                     │
│                     expected: DnsName(                                                                                                                                                                                                                                       │
│                         "apply.techcrunch.com",                                                                                                                                                                                                                              │
│                     ),                                                                                                                                                                                                                                                       │
│                     presented: [                                                                                                                                                                                                                                             │
│                         "DnsName(\"www.makers.com\")",                                                                                                                                                                                                                       │
│                         "DnsName(\"www.intheknow.com\")",                                                                                                                                                                                                                    │
│                         "DnsName(\"www.builtbygirls.com\")",                                                                                                                                                                                                                 │
│                         "DnsName(\"www.aol.jp\")",                                                                                                                                                                                                                           │
│                         "DnsName(\"www.aol.de\")",                                                                                                                                                                                                                           │
│                         "DnsName(\"www.aol.co.uk\")",                                                                                                                                                                                                                        │
│                         "DnsName(\"www.aol.ca\")",                                                                                                                                                                                                                           │
│                         "DnsName(\"welcomescreen.aol.de\")",                                                                                                                                                                                                                 │
│                         "DnsName(\"wave.builtbygirls.com\")",                                                                                                                                                                                                                │
│                         "DnsName(\"wave-stage.builtbygirls.com\")",                                                                                                                                                                                                          │
│                         "DnsName(\"w.sb.welcomescreen.aol.com\")",                                                                                                                                                                                                           │
│                         "DnsName(\"w.main.welcomescreen.aol.com\")",                                                                                                                                                                                                         │
│                         "DnsName(\"venta.automoviles.aol.com\")",                                                                                                                                                                                                            │
│                         "DnsName(\"toshiba.aol.ca\")",                                                                                                                                                                                                                       │
│                         "DnsName(\"talktalk.aol.co.uk\")",                                                                                                                                                                                                                   │
│                         "DnsName(\"support.builtbygirls.com\")",                                                                                                                                                                                                             │
│                         "DnsName(\"shop.intheknow.com\")",                                                                                                                                                                                                                   │
│                         "DnsName(\"premium.yahoofinance.com\")",                                                                                                                                                                                                             │
│                         "DnsName(\"o2.welcomescreen.aol.de\")",                                                                                                                                                                                                              │
│                         "DnsName(\"o2.aol.de\")",                                                                                                                                                                                                                            │
│                         "DnsName(\"news.aol.jp\")",                                                                                                                                                                                                                          │
│                         "DnsName(\"n.sb.welcomescreen.aol.com\")",                                                                                                                                                                                                           │
│                         "DnsName(\"n.main.welcomescreen.aol.com\")",                                                                                                                                                                                                         │
│                         "DnsName(\"fluxible.io.yahoo.net\")",                                                                                                                                                                                                                │
│                         "DnsName(\"engadget.com\")",                                                                                                                                                                                                                         │
│                         "DnsName(\"didomi.makers.com\")",                                                                                                                                                                                                                    │
│                         "DnsName(\"didomi.aol.de\")",                                                                                                                                                                                                                        │
│                         "DnsName(\"brb.yahoo.net\")",                                                                                                                                                                                                                        │
│                         "DnsName(\"aolbroadband.welcomescreen.aol.co.uk\")",                                                                                                                                                                                                 │
│                         "DnsName(\"aol.com\")",                                                                                                                                                                                                                              │
│                         "DnsName(\"acss.io.yahoo.net\")",                                                                                                                                                                                                                    │
│                         "DnsName(\"*.www.aol.com\")",                                                                                                                                                                                                                        │
│                         "DnsName(\"*.shop.intheknow.com\")",                                                                                                                                                                                                                 │
│                         "DnsName(\"*.engadget.com\")",                                                                                                                                                                                                                       │
│                         "DnsName(\"*.cashay.com\")",                                                                                                                                                                                                                         │
│                         "DnsName(\"*.aol.com\")",                                                                                                                                                                                                                            │
│                     ],                                                                                                                                                                                                                                                       │
│                 },                                                                                                                                                                                                                                                           │
│             ),                                                                                                                                                                                                                                                               │
│         },                                                                                                                                                                                                                                                                   │
│     },                                                                                                                                                                                                                                                                       │
│ )      
  1. Error 2
[__main__] ERROR Error in concurrent tasks: Failed to connect to the server.                                                                                                                                                                                                 │
│ Reason: hyper_util::client::legacy::Error(                                                                                                                                                                                                                                   │
│     Connect,                                                                                                                                                                                                                                                                 │
│     ConnectError(                                                                                                                                                                                                                                                            │
│         "dns error",                                                                                                                                                                                                                                                         │
│         Custom {                                                                                                                                                                                                                                                             │
│             kind: Uncategorized,                                                                                                                                                                                                                                             │
│             error: "failed to lookup address information: Name or service not known",                                                                                                                                                                                        │
│         },                                                                                                                                                                                                                                                                   │
│     ),                                                                                                                                                                                                                                                                       │
│ )                                                                                                                                                                                                                                                                            │
│ Traceback (most recent call last):                                                                                                                                                                                                                                           │
│   File "<frozen runpy>", line 198, in _run_module_as_main                                                                                                                                                                                                                    │
│   File "<frozen runpy>", line 88, in _run_code                                                                                                                                                                                                                               │
│   File "main.py", line 117, in <module>                                                                                                                                                                                                                         │
│     asyncio.run(run_concurrent_tasks())                                                                                                                                                                                                                                      │
│   File "/usr/local/lib/python3.12/asyncio/runners.py", line 195, in run                                                                                                                                                                                                      │
│     return runner.run(main)                                                                                                                                                                                                                                                  │
│            ^^^^^^^^^^^^^^^^                                                                                                                                                                                                                                                  │
│   File "/usr/local/lib/python3.12/asyncio/runners.py", line 118, in run                                                                                                                                                                                                      │
│     return self._loop.run_until_complete(task)                                                                                                                                                                                                                               │
│            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^                                                                                                                                                                                                                               │
│   File "/usr/local/lib/python3.12/asyncio/base_events.py", line 691, in run_until_complete                                                                                                                                                                                   │
│     return future.result()                                                                                                                                                                                                                                                   │
│            ^^^^^^^^^^^^^^^                                                                                                                                                                                                                                                   │
│   File "main.py", line 90, in run_concurrent_tasks                                                                                                                                                                                                              │
│     raise exception                                                                                                                                                                                                                                                          │
│   File "crawler.py", line 69, in crawl                                                                                                                                                                                                                          │
│     await self.crawler.run()                                                                                                                                                                                                                                                 │
│   File "/app/.venv/lib/python3.12/site-packages/crawlee/crawlers/_basic/_basic_crawler.py", line 697, in run                                                                                                                                                                 │
│     await run_task                                                                                                                                                                                                                                                           │
│   File "/app/.venv/lib/python3.12/site-packages/crawlee/crawlers/_basic/_basic_crawler.py", line 752, in _run_crawler                                                                                                                                                        │
│     await self._autoscaled_pool.run()                                                                                                                                                                                                                                        │
│   File "/app/.venv/lib/python3.12/site-packages/crawlee/_autoscaling/autoscaled_pool.py", line 126, in run                                                                                                                                                                   │
│     await run.result                                                                                                                                                                                                                                                         │
│   File "/app/.venv/lib/python3.12/site-packages/crawlee/_autoscaling/autoscaled_pool.py", line 277, in _worker_task                                                                                                                                                          │
│     await asyncio.wait_for(                                                                                                                                                                                                                                                  │
│   File "/usr/local/lib/python3.12/asyncio/tasks.py", line 520, in wait_for                                                                                                                                                                                                   │
│     return await fut                                                                                                                                                                                                                                                         │
│            ^^^^^^^^^                                                                                                                                                                                                                                                         │
│   File "/app/.venv/lib/python3.12/site-packages/crawlee/crawlers/_basic/_basic_crawler.py", line 1366, in __run_task_function                                                                                                                                                │
│     if not (await self._is_allowed_based_on_robots_txt_file(request.url)):                                                                                                                                                                                                   │
│             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^                                                                                                                                                                                                     │
│   File "/app/.venv/lib/python3.12/site-packages/crawlee/crawlers/_basic/_basic_crawler.py", line 1566, in _is_allowed_based_on_robots_txt_file                                                                                                                               │
│     robots_txt_file = await self._get_robots_txt_file_for_url(url)                                                                                                                                                                                                           │
│                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^                                                                                                                                                                                                           │
│   File "/app/.venv/lib/python3.12/site-packages/crawlee/crawlers/_basic/_basic_crawler.py", line 1589, in _get_robots_txt_file_for_url                                                                                                                                       │
│     robots_txt_file = await self._find_txt_file_for_url(url)                                                                                                                                                                                                                 │
│                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^                                                                                                                                                                                                                 │
│   File "/app/.venv/lib/python3.12/site-packages/crawlee/crawlers/_basic/_basic_crawler.py", line 1599, in _find_txt_file_for_url                                                                                                                                             │
│     return await RobotsTxtFile.find(url, self._http_client)                                                                                                                                                                                                                  │
│            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^                                                                                                                                                                                                                  │
│   File "/app/.venv/lib/python3.12/site-packages/crawlee/_utils/robots.py", line 48, in find                                                                                                                                                                                  │
│     return await cls.load(str(robots_url), http_client, proxy_info)                           

File "/app/.venv/lib/python3.12/site-packages/crawlee/_utils/robots.py", line 59, in load                                                                                                                                                                                  │
│     response = await http_client.send_request(url, proxy_info=proxy_info)                                                                                                                                                                                                    │
│   File "/app/.venv/lib/python3.12/site-packages/crawlee/http_clients/_impit.py", line 167, in send_request                                                                                                                                                                   │
│     response = await client.request(                                                                                                                                                                                                                                         │
│                ^^^^^^^^^^^^^^^^^^^^^                                                                                                                                                                                                                                         │
│ impit.ConnectError: Failed to connect to the server.                                                                                                                                                                                                                         │
│ Reason: hyper_util::client::legacy::Error(                                                                                                                                                                                                                                   │
│     Connect,                                                                                                                                                                                                                                                                 │
│     ConnectError(                                                                                                                                                                                                                                                            │
│         "dns error",                                                                                                                                                                                                                                                         │
│         Custom {                                                                                                                                                                                                                                                             │
│             kind: Uncategorized,                                                                                                                                                                                                                                             │
│             error: "failed to lookup address information: Name or service not known",                                                                                                                                                                                        │
│         },                                                                                                                                                                                                                                                                   │
│     ),                            

Metadata

Metadata

Assignees

Labels

bugSomething isn't working.t-toolingIssues with this label are in the ownership of the tooling team.

Type

No type

Projects

No projects

Relationships

None yet

Development

No branches or pull requests

Issue actions