Skip to content

State corruption when hydra-node storage becomes full during write operations #2268

@DeltaDeFiProtocol

Description

@DeltaDeFiProtocol

Summary

When the hydra-node storage persistence becomes full, incomplete write operations can corrupt the node state, potentially leading to data loss and requiring manual intervention to recover.

Description

I encountered a critical issue where my hydra-node became unresponsive due to state corruption. The root cause appears to be insufficient disk space during a state write operation, which resulted in an
incomplete write that corrupted the persisted state.

When attempting to restart the node from the persisted state, I received the following error:
Error when decoding from file /persistence/persistence-alice/state: "Unexpected end-of-input while parsing string literal"

This error indicates that the state file was truncated or corrupted during the write operation when storage became full.

Attached is a copy of the corrupted state.
state.zip

Steps to Reproduce

  1. Run hydra-node with limited storage space
  2. Allow storage to fill up during normal operation
  3. Node attempts to persist state during low disk space conditions
  4. Write operation is interrupted/incomplete due to storage constraints
  5. Node state becomes corrupted and unrecoverable
  6. Restart attempt fails with JSON parsing error due to truncated state file

Expected Behavior

  • Hydra-node should handle storage constraints gracefully
  • State writes should be atomic or include rollback mechanisms
  • Recovery mechanisms should be available for corrupted states

Actual Behavior

  • State corruption occurs when write operations are incomplete
  • Node becomes unresponsive/unrecoverable
  • Restart fails with "Unexpected end-of-input while parsing string literal" error
  • No clear recovery path without manual intervention

Environment

  • Hydra Node Version: 0.22.3

Questions for the Team

  1. Are there any existing safeguards against storage-related state corruption?
  2. Is there a planned fix or mitigation strategy for this issue?
  3. Are there recovery procedures for corrupted state files?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    Projects

    Status

    Triage 🏥

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions