Description
Description:
When running the STORM pipeline with the Google retriever, some URLs (which return 403 Forbidden) provide empty or malformed HTML. trafilatura logs errors such as "parsed tree length: 0, wrong data type or not valid HTML" and "empty HTML tree" before ultimately crashing with memory corruption errors ("corrupted size vs. prev_size", "double free or corruption (out)") that lead to a segmentation fault.
Steps to Reproduce:
- Run the STORM Wiki pipeline using a local Ollama model and the Google retriever.
- Submit a query that returns URLs known to produce 403 responses (e.g., from openai.com or sciencedirect.com).
- Observe that trafilatura logs parsing errors and eventually crashes with a segmentation fault.
Expected Behavior:
trafilatura should gracefully handle invalid or empty HTML content by logging a warning or error and returning an empty result, without causing memory corruption or a crash.
Actual Behavior:
After logging errors for invalid/empty HTML, trafilatura crashes with a segmentation fault due to memory corruption.
Environment:
- Python Version: 3.11-slim (Docker container)
- Operating System: Docker container based on Debian slim
Additional Notes:
- No custom user-agent headers or modifications have been applied.
- The issue appears to be triggered when processing HTML content that is empty or invalid, especially in cases where the URL returns a 403 Forbidden response.