-
Notifications
You must be signed in to change notification settings - Fork 538
Closed
Labels
bugSomething isn't working.Something isn't working.t-toolingIssues with this label are in the ownership of the tooling team.Issues with this label are in the ownership of the tooling team.
Milestone
Description
I was crawling techcrunch.com while respecting robots.txt and got these two errors (two different traces showed) in different moments (which kill the crawler). It looks like a transient error as I am not able to reproduce it but it happens very often, after just crawling a few pages. I am using a proxy in case it helps
- Error 1
│
│ INFO Processing https://techcrunch.com/2008/03/19/review-rock-band-the-operative-word-is-rock/ ... │
│ [__main__] ERROR Task crawler failed with exception │
│ NoneType: None │
│ [__main__] ERROR Error in concurrent tasks: Failed to connect to the server. │
│ Reason: hyper_util::client::legacy::Error(
│ NoneType: None │
│ [__main__] ERROR Error in concurrent tasks: Failed to connect to the server. │
│ Reason: hyper_util::client::legacy::Error( │
│ Connect, │
│ Custom { │
│ kind: Other, │
│ error: Custom { │
│ kind: InvalidData, │
│ error: InvalidCertificate( │
│ NotValidForNameContext { │
│ expected: DnsName( │
│ "apply.techcrunch.com", │
│ ), │
│ presented: [ │
│ "DnsName(\"www.makers.com\")", │
│ "DnsName(\"www.intheknow.com\")", │
│ "DnsName(\"www.builtbygirls.com\")", │
│ "DnsName(\"www.aol.jp\")", │
│ "DnsName(\"www.aol.de\")", │
│ "DnsName(\"www.aol.co.uk\")", │
│ "DnsName(\"www.aol.ca\")", │
│ "DnsName(\"welcomescreen.aol.de\")", │
│ "DnsName(\"wave.builtbygirls.com\")", │
│ "DnsName(\"wave-stage.builtbygirls.com\")", │
│ "DnsName(\"w.sb.welcomescreen.aol.com\")", │
│ "DnsName(\"w.main.welcomescreen.aol.com\")", │
│ "DnsName(\"venta.automoviles.aol.com\")", │
│ "DnsName(\"toshiba.aol.ca\")", │
│ "DnsName(\"talktalk.aol.co.uk\")", │
│ "DnsName(\"support.builtbygirls.com\")", │
│ "DnsName(\"shop.intheknow.com\")", │
│ "DnsName(\"premium.yahoofinance.com\")", │
│ "DnsName(\"o2.welcomescreen.aol.de\")", │
│ "DnsName(\"o2.aol.de\")", │
│ "DnsName(\"news.aol.jp\")", │
│ "DnsName(\"n.sb.welcomescreen.aol.com\")", │
│ "DnsName(\"n.main.welcomescreen.aol.com\")", │
│ "DnsName(\"fluxible.io.yahoo.net\")", │
│ "DnsName(\"engadget.com\")", │
│ "DnsName(\"didomi.makers.com\")", │
│ "DnsName(\"didomi.aol.de\")", │
│ "DnsName(\"brb.yahoo.net\")", │
│ "DnsName(\"aolbroadband.welcomescreen.aol.co.uk\")", │
│ "DnsName(\"aol.com\")", │
│ "DnsName(\"acss.io.yahoo.net\")", │
│ "DnsName(\"*.www.aol.com\")", │
│ "DnsName(\"*.shop.intheknow.com\")", │
│ "DnsName(\"*.engadget.com\")", │
│ "DnsName(\"*.cashay.com\")", │
│ "DnsName(\"*.aol.com\")", │
│ ], │
│ }, │
│ ), │
│ }, │
│ }, │
│ ) │
│ │ Traceback (most recent call last): │
│ File "/usr/local/lib/python3.12/asyncio/runners.py", line 195, in run │
│ return runner.run(main) │
│ ^^^^^^^^^^^^^^^^ │
│ File "/usr/local/lib/python3.12/asyncio/runners.py", line 118, in run │
│ return self._loop.run_until_complete(task) │
│ ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ │
│ File "/usr/local/lib/python3.12/asyncio/base_events.py", line 691, in run_until_complete │
│ return future.result() │
│ ^^^^^^^^^^^^^^^ │
│ File "/main.py", line 90, in run_concurrent_tasks │
│ raise exception │
│ File "/crawler.py", line 69, in crawl │
│ await self.crawler.run() │
│ File "/crawlers/_basic/_basic_crawler.py", line 697, in run │
│ await run_task │
│ File "/crawlers/_basic/_basic_crawler.py", line 752, in _run_crawler │
│ await self._autoscaled_pool.run() │
│ File "/_autoscaling/autoscaled_pool.py", line 126, in run │
│ await run.result │
│ File "/_autoscaling/autoscaled_pool.py", line 277, in _worker_task │
│ await asyncio.wait_for( │
│ File "/usr/local/lib/python3.12/asyncio/tasks.py", line 520, in wait_for │
│ return await fut │
│ ^^^^^^^^^ │
│ File "/crawlers/_basic/_basic_crawler.py", line 1366, in __run_task_function │
│ if not (await self._is_allowed_based_on_robots_txt_file(request.url)): │
│ ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ │
│ File "/crawlers/_basic/_basic_crawler.py", line 1566, in _is_allowed_based_on_robots_txt_file │
│ robots_txt_file = await self._get_robots_txt_file_for_url(url) │
│ ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ │
│ File "/crawlers/_basic/_basic_crawler.py", line 1589, in _get_robots_txt_file_for_url │
│ robots_txt_file = await self._find_txt_file_for_url(url) │
│ ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ │
│ File "/crawlers/_basic/_basic_crawler.py", line 1599, in _find_txt_file_for_url │
│ return await RobotsTxtFile.find(url, self._http_client) │
│ ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ │
│ File "/_utils/robots.py", line 48, in find │
│ return await cls.load(str(robots_url), http_client, proxy_info) │
│ ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ │
│ File "/_utils/robots.py", line 59, in load │
│ response = await http_client.send_request(url, proxy_info=proxy_info) │
│ ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ │
│ File "/http_clients/_impit.py", line 167, in send_request │
│ response = await client.request( │
│ ^^^^^^^^^^^^^^^^^^^^^
│ ^^^^^^^^^^^^^^^^^^^^^ │
│ impit.ConnectError: Failed to connect to the server. │
│ Reason: hyper_util::client::legacy::Error( │
│ Connect, │
│ Custom { │
│ kind: Other, │
│ error: Custom { │
│ kind: InvalidData, │
│ error: InvalidCertificate( │
│ NotValidForNameContext { │
│ expected: DnsName( │
│ "apply.techcrunch.com", │
│ ), │
│ presented: [ │
│ "DnsName(\"www.makers.com\")", │
│ "DnsName(\"www.intheknow.com\")", │
│ "DnsName(\"www.builtbygirls.com\")", │
│ "DnsName(\"www.aol.jp\")", │
│ "DnsName(\"www.aol.de\")", │
│ "DnsName(\"www.aol.co.uk\")", │
│ "DnsName(\"www.aol.ca\")", │
│ "DnsName(\"welcomescreen.aol.de\")", │
│ "DnsName(\"wave.builtbygirls.com\")", │
│ "DnsName(\"wave-stage.builtbygirls.com\")", │
│ "DnsName(\"w.sb.welcomescreen.aol.com\")", │
│ "DnsName(\"w.main.welcomescreen.aol.com\")", │
│ "DnsName(\"venta.automoviles.aol.com\")", │
│ "DnsName(\"toshiba.aol.ca\")", │
│ "DnsName(\"talktalk.aol.co.uk\")", │
│ "DnsName(\"support.builtbygirls.com\")", │
│ "DnsName(\"shop.intheknow.com\")", │
│ "DnsName(\"premium.yahoofinance.com\")", │
│ "DnsName(\"o2.welcomescreen.aol.de\")", │
│ "DnsName(\"o2.aol.de\")", │
│ "DnsName(\"news.aol.jp\")", │
│ "DnsName(\"n.sb.welcomescreen.aol.com\")", │
│ "DnsName(\"n.main.welcomescreen.aol.com\")", │
│ "DnsName(\"fluxible.io.yahoo.net\")", │
│ "DnsName(\"engadget.com\")", │
│ "DnsName(\"didomi.makers.com\")", │
│ "DnsName(\"didomi.aol.de\")", │
│ "DnsName(\"brb.yahoo.net\")", │
│ "DnsName(\"aolbroadband.welcomescreen.aol.co.uk\")", │
│ "DnsName(\"aol.com\")", │
│ "DnsName(\"acss.io.yahoo.net\")", │
│ "DnsName(\"*.www.aol.com\")", │
│ "DnsName(\"*.shop.intheknow.com\")", │
│ "DnsName(\"*.engadget.com\")", │
│ "DnsName(\"*.cashay.com\")", │
│ "DnsName(\"*.aol.com\")", │
│ ], │
│ }, │
│ ), │
│ }, │
│ }, │
│ )
- Error 2
[__main__] ERROR Error in concurrent tasks: Failed to connect to the server. │
│ Reason: hyper_util::client::legacy::Error( │
│ Connect, │
│ ConnectError( │
│ "dns error", │
│ Custom { │
│ kind: Uncategorized, │
│ error: "failed to lookup address information: Name or service not known", │
│ }, │
│ ), │
│ ) │
│ Traceback (most recent call last): │
│ File "<frozen runpy>", line 198, in _run_module_as_main │
│ File "<frozen runpy>", line 88, in _run_code │
│ File "main.py", line 117, in <module> │
│ asyncio.run(run_concurrent_tasks()) │
│ File "/usr/local/lib/python3.12/asyncio/runners.py", line 195, in run │
│ return runner.run(main) │
│ ^^^^^^^^^^^^^^^^ │
│ File "/usr/local/lib/python3.12/asyncio/runners.py", line 118, in run │
│ return self._loop.run_until_complete(task) │
│ ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ │
│ File "/usr/local/lib/python3.12/asyncio/base_events.py", line 691, in run_until_complete │
│ return future.result() │
│ ^^^^^^^^^^^^^^^ │
│ File "main.py", line 90, in run_concurrent_tasks │
│ raise exception │
│ File "crawler.py", line 69, in crawl │
│ await self.crawler.run() │
│ File "/app/.venv/lib/python3.12/site-packages/crawlee/crawlers/_basic/_basic_crawler.py", line 697, in run │
│ await run_task │
│ File "/app/.venv/lib/python3.12/site-packages/crawlee/crawlers/_basic/_basic_crawler.py", line 752, in _run_crawler │
│ await self._autoscaled_pool.run() │
│ File "/app/.venv/lib/python3.12/site-packages/crawlee/_autoscaling/autoscaled_pool.py", line 126, in run │
│ await run.result │
│ File "/app/.venv/lib/python3.12/site-packages/crawlee/_autoscaling/autoscaled_pool.py", line 277, in _worker_task │
│ await asyncio.wait_for( │
│ File "/usr/local/lib/python3.12/asyncio/tasks.py", line 520, in wait_for │
│ return await fut │
│ ^^^^^^^^^ │
│ File "/app/.venv/lib/python3.12/site-packages/crawlee/crawlers/_basic/_basic_crawler.py", line 1366, in __run_task_function │
│ if not (await self._is_allowed_based_on_robots_txt_file(request.url)): │
│ ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ │
│ File "/app/.venv/lib/python3.12/site-packages/crawlee/crawlers/_basic/_basic_crawler.py", line 1566, in _is_allowed_based_on_robots_txt_file │
│ robots_txt_file = await self._get_robots_txt_file_for_url(url) │
│ ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ │
│ File "/app/.venv/lib/python3.12/site-packages/crawlee/crawlers/_basic/_basic_crawler.py", line 1589, in _get_robots_txt_file_for_url │
│ robots_txt_file = await self._find_txt_file_for_url(url) │
│ ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ │
│ File "/app/.venv/lib/python3.12/site-packages/crawlee/crawlers/_basic/_basic_crawler.py", line 1599, in _find_txt_file_for_url │
│ return await RobotsTxtFile.find(url, self._http_client) │
│ ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ │
│ File "/app/.venv/lib/python3.12/site-packages/crawlee/_utils/robots.py", line 48, in find │
│ return await cls.load(str(robots_url), http_client, proxy_info)
File "/app/.venv/lib/python3.12/site-packages/crawlee/_utils/robots.py", line 59, in load │
│ response = await http_client.send_request(url, proxy_info=proxy_info) │
│ File "/app/.venv/lib/python3.12/site-packages/crawlee/http_clients/_impit.py", line 167, in send_request │
│ response = await client.request( │
│ ^^^^^^^^^^^^^^^^^^^^^ │
│ impit.ConnectError: Failed to connect to the server. │
│ Reason: hyper_util::client::legacy::Error( │
│ Connect, │
│ ConnectError( │
│ "dns error", │
│ Custom { │
│ kind: Uncategorized, │
│ error: "failed to lookup address information: Name or service not known", │
│ }, │
│ ),
Metadata
Metadata
Assignees
Labels
bugSomething isn't working.Something isn't working.t-toolingIssues with this label are in the ownership of the tooling team.Issues with this label are in the ownership of the tooling team.