-
Notifications
You must be signed in to change notification settings - Fork 1
feat(rst): cleanup incomplete downloads #196
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
69e6399
to
ae5b525
Compare
71f4dc6
to
c3bf43d
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I found one corner case that may lead to accidental data loss. When forcibly cancelling jobs, if there was ever a download in the job history it will reset the file back to a stub:
$ cat test
hello world
$ beegfs remote job list test
OK PATH TARGET JOB_UPDATED REQUEST_TYPE
✅ /test 1 2025-07-08T18:31:24Z UPLOAD
$ beegfs remote job cancel test --force
OK PATH TARGET JOB_ID REQUEST_TYPE
🚫 /test 1 8c18cc00-62da-4c1f-8845-84c04ea4b87f UPLOAD
🚫 /test 1 7ee0c2fa-1ec5-497c-a88a-76c9010481f6 DOWNLOAD
$ cat test
rst://1:test
This happens even if the download was not the most recent job and was already cancelled.
We could set the job.SetStopMtime(timestamppb.New(mtime))
with a defer after executing all the completeSyncWorkRequests_Download()
functionality and only attempt to cleanup incomplete downloads if there is no job.StopMtime set. However this would mean abort is no longer idempotent, and there still might be corner cases where we delete data unintentionally.
I think the "right" way to do this is to download to a temporary file then do an atomic rename when completing the download. Then in the case of an abort we just make sure the temporary file is deleted (and ignore errors if it no longer exists). That ensures aborts can be idempotent. One way for this to work is to use the external_id
field to hold the temp path (e.g., .myfile-job-id) for downloads.
… during the download cleanup
c3bf43d
to
a4922cb
Compare
We discussed some of the challenges cleaning up incomplete downloads over Slack and I'm dropping my notes here for the record. This is not an issue when cancel only affects a single job, but there are scenarios where it can affect multiple jobs (i.e., push for RST 1, pull from RST 2, or force cancellations) that might result in accidental data loss. These were possible approaches we came up with and the issues we identified with each:
Viable options:
While leaving around files from failed pulls is arguably not a big deal (the pull can always be rerun with |
What does this PR do / why do we need it?
These changes clean up partially downloaded files so that they are in a valid state. Files will revert to stubs for jobs where the file previously existed. Otherwise, the file will be deleted.
Related Issue(s)
Required when applicable.
Where should the reviewer(s) start reviewing this?
Only required for larger PRs when this may not be immediately obvious.
Are there any specific topics we should discuss before merging?
Not required.
What are the next steps after this PR?
Not required.
Checklist before merging:
Required for all PRs.
When creating a PR these are items to keep in mind that cannot be checked by GitHub actions:
For more details refer to the Go coding standards and the pull request process.