Improve large-run resilience: checkpoint timeouts + CSV save fallback#63
Merged
hemanth-asirvatham merged 1 commit intomainfrom Feb 25, 2026
Merged
Conversation
|
I have read the CLA Document and I hereby sign the CLA You can retrigger this bot by commenting recheck in this Pull Request. Posted by the CLA Assistant Lite bot. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to subscribe to this conversation on GitHub.
Already have an account?
Sign in.
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Motivation
Description
get_all_responsesfrom 50k to 100k and soften the message to suggest splitting only if API/timeout issues are encountered, usingget_all_responsesparameters for context.Time Takenvalues from existing checkpoint CSVs and initializing the dynamic timeout via new helpers_collect_successful_time_taken_samplesand_compute_dynamic_timeout_from_samplesinsrc/gabriel/utils/openai_utils.py.save_dataframe_with_fallbackinsrc/gabriel/utils/file_utils.py(exported viagabriel.utils) which attempts a normalDataFrame.to_csvand on failure writes 100k-row split files (<stem>_1.csv,<stem>_2.csv, ...), logging and printing informative messages without raising.rate,classify,extract) and into other final-output tasks where simple to do (deidentify,paraphrase,codify,seed,debias) to improve robustness on large outputs.Testing
tests/test_timeout_logic.py) and for checkpoint resume behaviour (tests/test_checkpoint_timeout_resume.py) which validate that resumed runs can bootstrap dynamic timeouts from checkpointTime Takensamples.tests/test_save_dataframe_fallback.py) covering successful split fallback and all-writes-fail scenarios.python -m compileall -q src testsandpytest -qcompleted successfully with all tests passing (188 passed, 6 skipped).Codex Task