Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

No timeout in urllib.robotparser with focused_crawler #566

Closed
JER-CE opened this issue Apr 19, 2024 · 2 comments · Fixed by #590
Closed

No timeout in urllib.robotparser with focused_crawler #566

JER-CE opened this issue Apr 19, 2024 · 2 comments · Fixed by #590
Labels
bug Something isn't working

Comments

@JER-CE
Copy link

JER-CE commented Apr 19, 2024

Some URLs will cause the focused_crawler to never return anything. The below code runs indefinitely:

from trafilatura.spider import focused_crawler
import logging

logging.basicConfig(level=logging.DEBUG)
url = 'https://www.maersk.com/news/category/press-releases'
focused_crawler(url, max_seen_urls=1, max_known_urls=100)
@adbar adbar added the enhancement New feature or request label Apr 19, 2024
@adbar
Copy link
Owner

adbar commented Apr 19, 2024

Good point, it's not a bug in itself, the feature is not implemented yet. Let's put that on the list.

@adbar adbar changed the title No timeout for some URLs when using focused_crawler No timeout in urllib.robotparser with focused_crawler May 8, 2024
@adbar adbar added bug Something isn't working and removed enhancement New feature or request labels May 8, 2024
@adbar
Copy link
Owner

adbar commented May 8, 2024

@JER-CE It was actually a bug, urllib.robotparser can try to load the URL indefinitely. Thanks for the detailed code snippet, it really helps.

Now it's still not working (on my computer at least) but we know why while debugging: the download fails (probably a user-agent issue).

@adbar adbar linked a pull request May 8, 2024 that will close this issue
@adbar adbar closed this as completed in #590 May 8, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants