Skip to content

Improve large-run resilience: checkpoint timeouts + CSV save fallback#63

Merged
hemanth-asirvatham merged 1 commit intomainfrom
update-get-all-responses-error-handling
Feb 25, 2026
Merged

Improve large-run resilience: checkpoint timeouts + CSV save fallback#63
hemanth-asirvatham merged 1 commit intomainfrom
update-get-all-responses-error-handling

Conversation

@hemanth-asirvatham
Copy link
Collaborator

Motivation

  • Long runs can hit timeouts and produce huge final CSVs that fail to save; make resumed runs reuse prior timing info and avoid crashing on writes.
  • The previous large-run warning at 50k rows was too conservative and the message could be friendlier.
  • Provide a simple, reusable fallback for writing very large DataFrames so tasks return results even when disk saves fail.

Description

  • Raise the large-run advisory threshold in get_all_responses from 50k to 100k and soften the message to suggest splitting only if API/timeout issues are encountered, using get_all_responses parameters for context.
  • Add checkpoint-aware timeout bootstrapping by extracting successful Time Taken values from existing checkpoint CSVs and initializing the dynamic timeout via new helpers _collect_successful_time_taken_samples and _compute_dynamic_timeout_from_samples in src/gabriel/utils/openai_utils.py.
  • Introduce save_dataframe_with_fallback in src/gabriel/utils/file_utils.py (exported via gabriel.utils) which attempts a normal DataFrame.to_csv and on failure writes 100k-row split files (<stem>_1.csv, <stem>_2.csv, ...), logging and printing informative messages without raising.
  • Wire the fallback saver into task outputs that commonly emit final/aggregated CSVs (rate, classify, extract) and into other final-output tasks where simple to do (deidentify, paraphrase, codify, seed, debias) to improve robustness on large outputs.

Testing

  • Added unit tests for the timeout helpers (tests/test_timeout_logic.py) and for checkpoint resume behaviour (tests/test_checkpoint_timeout_resume.py) which validate that resumed runs can bootstrap dynamic timeouts from checkpoint Time Taken samples.
  • Added tests for CSV fallback behaviour (tests/test_save_dataframe_fallback.py) covering successful split fallback and all-writes-fail scenarios.
  • Ran full test suite and compilation: python -m compileall -q src tests and pytest -q completed successfully with all tests passing (188 passed, 6 skipped).

Codex Task

@github-actions
Copy link


Thank you for your submission, we really appreciate it. Like many open-source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution. You can sign the CLA by just posting a Pull Request Comment same as the below format.


I have read the CLA Document and I hereby sign the CLA


You can retrigger this bot by commenting recheck in this Pull Request. Posted by the CLA Assistant Lite bot.

@hemanth-asirvatham hemanth-asirvatham merged commit 38dc87b into main Feb 25, 2026
1 check failed
@github-actions github-actions bot locked and limited conversation to collaborators Feb 25, 2026
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant