No timeout in urllib.robotparser with focused_crawler #566

JER-CE · 2024-04-19T12:51:14Z

Some URLs will cause the focused_crawler to never return anything. The below code runs indefinitely:

from trafilatura.spider import focused_crawler
import logging

logging.basicConfig(level=logging.DEBUG)
url = 'https://www.maersk.com/news/category/press-releases'
focused_crawler(url, max_seen_urls=1, max_known_urls=100)

The text was updated successfully, but these errors were encountered:

adbar · 2024-04-19T14:53:54Z

Good point, it's not a bug in itself, the feature is not implemented yet. Let's put that on the list.

adbar · 2024-05-08T09:16:42Z

@JER-CE It was actually a bug, urllib.robotparser can try to load the URL indefinitely. Thanks for the detailed code snippet, it really helps.

Now it's still not working (on my computer at least) but we know why while debugging: the download fails (probably a user-agent issue).

adbar added the enhancement New feature or request label Apr 19, 2024

adbar changed the title ~~No timeout for some URLs when using focused_crawler~~ No timeout in urllib.robotparser with focused_crawler May 8, 2024

adbar added bug Something isn't working and removed enhancement New feature or request labels May 8, 2024

adbar linked a pull request May 8, 2024 that will close this issue

spider fix: use internal download utilities for robots.txt #590

Merged

adbar closed this as completed in #590 May 8, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

No timeout in urllib.robotparser with focused_crawler #566

No timeout in urllib.robotparser with focused_crawler #566

JER-CE commented Apr 19, 2024

adbar commented Apr 19, 2024

adbar commented May 8, 2024

No timeout in urllib.robotparser with focused_crawler #566

No timeout in urllib.robotparser with focused_crawler #566

Comments

JER-CE commented Apr 19, 2024

adbar commented Apr 19, 2024

adbar commented May 8, 2024