[Core] Fix ObjectFetchTimedOutError #46562
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Why are these changes needed?
TL;DR the bug is that for an object that's being reconstructed, we failed to set
pending_creation
to true so pull manager fails withObjectFetchTimedOutError
.There are two places where we mark the object as pending_creation:
ReferenceCounter::UpdateSubmittedTaskReferences
).ReferenceCounter::UpdateResubmittedTaskReferences
).However, if the creator task is re-executing due to failure (e.g. worker process crash,
TaskManager::RetryTaskIfPossible
), the return object is not marked as pending_creation, which is fine for non-streaming-generator tasks since return objects are atomically available only after the creator task finishes so they should still remain pending_creation during retry. However for streaming generator task, some return objects may be already available and used by downstream tasks when the generator task fails.A repro sequence of events:
ReferenceCounter::UpdateResubmittedTaskReferences
is not called).This PR fixes this issue by marking pending_creation whenever the object is lost and the creator task is running to re-create it instead of inside
ReferenceCounter::UpdateResubmittedTaskReferences
which might not be called.Related issue number
Checks
git commit -s
) in this PR.scripts/format.sh
to lint the changes in this PR.method in Tune, I've added it in
doc/source/tune/api/
under thecorresponding
.rst
file.