Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: Reliable Parsing of Dynamic Websites #644

Open
addie9800 opened this issue Oct 22, 2024 · 0 comments
Open

[Bug]: Reliable Parsing of Dynamic Websites #644

addie9800 opened this issue Oct 22, 2024 · 0 comments
Labels
bug Something isn't working

Comments

@addie9800
Copy link
Collaborator

addie9800 commented Oct 22, 2024

Describe the bug

Some Websites occasionally don't return the full HTML, but rather an HTML page mostly containing script elements. I first noticed this while working with BoersenZeitung

How to reproduce

from fundus import PublisherCollection, Crawler
from fundus.logging import set_log_level
from logging import DEBUG

publisher = PublisherCollection.de.BoersenZeitung
crawler = Crawler(publisher)
set_log_level(DEBUG)
for article in crawler.crawl(max_articles=50, only_complete=False, error_handling="suppress"):
    print(article.html.responded_url)
    print(article.title)
    print("--------------------------------")

Expected behavior.

I would expect to consistently see a title being parsed and printed

Logs and Stack traces

No response

Screenshots

Logs in 1. iteration:

image

Logs in 2. iteration:

image

Additional Context

Here is an example of an incomplete HTML file test.zip

Environment

python==3.9

aiohttp==3.8.6
aioitertools==0.11.0     
aiosignal==1.3.1         
async-timeout==4.0.3     
attrs==23.2.0            
black==23.1.0            
Brotli==1.1.0            
certifi==2024.2.2        
chardet==5.2.0           
charset-normalizer==3.3.2
click==8.1.7             
colorama==0.4.6          
cssselect==1.2.0         
decorator==5.1.1         
dict2xml==1.7.6          
dill==0.3.8              
exceptiongroup==1.2.0
FastWARC==0.14.5
feedparser==6.0.11
frozenlist==1.4.1
-e git+https://github.com/flairNLP/fundus.git@05cc97dd8be59ac05d89456ac0db39cddce74e02#egg=fundus
idna==3.6
iniconfig==2.0.0
isort==5.12.0
langdetect==1.0.9
lxml==4.9.4
more-itertools==9.1.0
multidict==6.0.4
mypy==1.9.0
mypy-extensions==1.0.0
numpy==1.26.4
packaging==23.2
pandas==2.2.2
pathspec==0.12.1
platformdirs==4.1.0
pluggy==1.4.0
pytest==7.2.2
python-dateutil==2.8.2
pytz==2024.1
requests==2.31.0
robotspy==0.10.0
sgmllib3k==1.0.0
six==1.16.0
tomli==2.0.1
tqdm==4.66.1
types-colorama==0.4.15.20240106
types-lxml==2023.2.11
types-python-dateutil==2.8.19.20240106
types-requests==2.28.11.17
types-urllib3==1.26.25.14
typing_extensions==4.9.0
tzdata==2024.1
urllib3==2.2.0
validators==0.28.0
xmltodict==0.14.1
yarl==1.9.4
@addie9800 addie9800 added the bug Something isn't working label Oct 22, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant