Skip to content

Segmentation Fault in trafilatura When Processing Empty/Invalid HTML Content (403 Responses) #340

Open
@milesian01

Description

@milesian01

Description:
When running the STORM pipeline with the Google retriever, some URLs (which return 403 Forbidden) provide empty or malformed HTML. trafilatura logs errors such as "parsed tree length: 0, wrong data type or not valid HTML" and "empty HTML tree" before ultimately crashing with memory corruption errors ("corrupted size vs. prev_size", "double free or corruption (out)") that lead to a segmentation fault.

Steps to Reproduce:

  1. Run the STORM Wiki pipeline using a local Ollama model and the Google retriever.
  2. Submit a query that returns URLs known to produce 403 responses (e.g., from openai.com or sciencedirect.com).
  3. Observe that trafilatura logs parsing errors and eventually crashes with a segmentation fault.

Expected Behavior:
trafilatura should gracefully handle invalid or empty HTML content by logging a warning or error and returning an empty result, without causing memory corruption or a crash.

Actual Behavior:
After logging errors for invalid/empty HTML, trafilatura crashes with a segmentation fault due to memory corruption.

Environment:

  • Python Version: 3.11-slim (Docker container)
  • Operating System: Docker container based on Debian slim

Additional Notes:

  • No custom user-agent headers or modifications have been applied.
  • The issue appears to be triggered when processing HTML content that is empty or invalid, especially in cases where the URL returns a 403 Forbidden response.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions