You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Some Websites occasionally don't return the full HTML, but rather an HTML page mostly containing script elements. I first noticed this while working with BoersenZeitung
Describe the bug
Some Websites occasionally don't return the full HTML, but rather an HTML page mostly containing script elements. I first noticed this while working with
BoersenZeitung
How to reproduce
Expected behavior.
I would expect to consistently see a title being parsed and printed
Logs and Stack traces
No response
Screenshots
Logs in 1. iteration:
Logs in 2. iteration:
Additional Context
Here is an example of an incomplete HTML file test.zip
Environment
python==3.9 aiohttp==3.8.6 aioitertools==0.11.0 aiosignal==1.3.1 async-timeout==4.0.3 attrs==23.2.0 black==23.1.0 Brotli==1.1.0 certifi==2024.2.2 chardet==5.2.0 charset-normalizer==3.3.2 click==8.1.7 colorama==0.4.6 cssselect==1.2.0 decorator==5.1.1 dict2xml==1.7.6 dill==0.3.8 exceptiongroup==1.2.0 FastWARC==0.14.5 feedparser==6.0.11 frozenlist==1.4.1 -e git+https://github.com/flairNLP/fundus.git@05cc97dd8be59ac05d89456ac0db39cddce74e02#egg=fundus idna==3.6 iniconfig==2.0.0 isort==5.12.0 langdetect==1.0.9 lxml==4.9.4 more-itertools==9.1.0 multidict==6.0.4 mypy==1.9.0 mypy-extensions==1.0.0 numpy==1.26.4 packaging==23.2 pandas==2.2.2 pathspec==0.12.1 platformdirs==4.1.0 pluggy==1.4.0 pytest==7.2.2 python-dateutil==2.8.2 pytz==2024.1 requests==2.31.0 robotspy==0.10.0 sgmllib3k==1.0.0 six==1.16.0 tomli==2.0.1 tqdm==4.66.1 types-colorama==0.4.15.20240106 types-lxml==2023.2.11 types-python-dateutil==2.8.19.20240106 types-requests==2.28.11.17 types-urllib3==1.26.25.14 typing_extensions==4.9.0 tzdata==2024.1 urllib3==2.2.0 validators==0.28.0 xmltodict==0.14.1 yarl==1.9.4
The text was updated successfully, but these errors were encountered: