Problem:
When process_file() is called with tracking="delta_table", it writes a tracking record as soon as the file is successfully read. If downstream processing fails (e.g. column count validation, schema mismatch, write failure), the tracking record persists. On the next run, the file is skipped because it appears already processed.
This means any transient or configuration error permanently blocks a file from being reprocessed without manual intervention (deleting the tracking record or using clear_tracking).
Steps to reproduce:
- Call
process_file() with a valid file path and tracking="delta_table"
- The file reads successfully — tracking record is written
- After
process_file() returns, apply column validation that fails (e.g. wrong delimiter configured, column count mismatch)
- On next run,
process_file() skips the file — it's already tracked
Expected behaviour:
A file should only be marked as tracked after the caller confirms the full pipeline succeeded. Either:
- Defer tracking write until the caller explicitly commits it
- Provide a two-phase API:
process_file() returns result without tracking, caller calls confirm_tracking() after successful write
Workaround:
Callers must either:
- Manually delete tracking records for failed files
- Use
clear_tracking() with a reprocess flag to force re-read
Acceptance criteria:
Problem:
When
process_file()is called withtracking="delta_table", it writes a tracking record as soon as the file is successfully read. If downstream processing fails (e.g. column count validation, schema mismatch, write failure), the tracking record persists. On the next run, the file is skipped because it appears already processed.This means any transient or configuration error permanently blocks a file from being reprocessed without manual intervention (deleting the tracking record or using
clear_tracking).Steps to reproduce:
process_file()with a valid file path andtracking="delta_table"process_file()returns, apply column validation that fails (e.g. wrong delimiter configured, column count mismatch)process_file()skips the file — it's already trackedExpected behaviour:
A file should only be marked as tracked after the caller confirms the full pipeline succeeded. Either:
process_file()returns result without tracking, caller callsconfirm_tracking()after successful writeWorkaround:
Callers must either:
clear_tracking()with a reprocess flag to force re-readAcceptance criteria:
raise_on_error=Falseare not broken by the change