Segmentation Fault in trafilatura When Processing Empty/Invalid HTML Content (403 Responses)

**Description:**  
When running the STORM pipeline with the Google retriever, some URLs (which return 403 Forbidden) provide empty or malformed HTML. trafilatura logs errors such as "parsed tree length: 0, wrong data type or not valid HTML" and "empty HTML tree" before ultimately crashing with memory corruption errors ("corrupted size vs. prev_size", "double free or corruption (out)") that lead to a segmentation fault.

**Steps to Reproduce:**  
1. Run the STORM Wiki pipeline using a local Ollama model and the Google retriever.  
2. Submit a query that returns URLs known to produce 403 responses (e.g., from openai.com or sciencedirect.com).  
3. Observe that trafilatura logs parsing errors and eventually crashes with a segmentation fault.

**Expected Behavior:**  
trafilatura should gracefully handle invalid or empty HTML content by logging a warning or error and returning an empty result, without causing memory corruption or a crash.

**Actual Behavior:**  
After logging errors for invalid/empty HTML, trafilatura crashes with a segmentation fault due to memory corruption.

**Environment:**  
- **Python Version:** 3.11-slim (Docker container)  
- **Operating System:** Docker container based on Debian slim

**Additional Notes:**  
- No custom user-agent headers or modifications have been applied.  
- The issue appears to be triggered when processing HTML content that is empty or invalid, especially in cases where the URL returns a 403 Forbidden response.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Segmentation Fault in trafilatura When Processing Empty/Invalid HTML Content (403 Responses) #340

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Segmentation Fault in trafilatura When Processing Empty/Invalid HTML Content (403 Responses) #340

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions