Skip to content

process_file writes tracking record before downstream validation succeeds #12

@Iwan-Dyke

Description

@Iwan-Dyke

Problem:

When process_file() is called with tracking="delta_table", it writes a tracking record as soon as the file is successfully read. If downstream processing fails (e.g. column count validation, schema mismatch, write failure), the tracking record persists. On the next run, the file is skipped because it appears already processed.

This means any transient or configuration error permanently blocks a file from being reprocessed without manual intervention (deleting the tracking record or using clear_tracking).

Steps to reproduce:

  1. Call process_file() with a valid file path and tracking="delta_table"
  2. The file reads successfully — tracking record is written
  3. After process_file() returns, apply column validation that fails (e.g. wrong delimiter configured, column count mismatch)
  4. On next run, process_file() skips the file — it's already tracked

Expected behaviour:

A file should only be marked as tracked after the caller confirms the full pipeline succeeded. Either:

  • Defer tracking write until the caller explicitly commits it
  • Provide a two-phase API: process_file() returns result without tracking, caller calls confirm_tracking() after successful write

Workaround:

Callers must either:

  • Manually delete tracking records for failed files
  • Use clear_tracking() with a reprocess flag to force re-read

Acceptance criteria:

  • Tracking records are not persisted for files that fail downstream processing
  • Existing callers using raise_on_error=False are not broken by the change
  • A file that fails after read can be retried on the next run without manual intervention

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions