feat(rst): cleanup incomplete downloads #196

swartzn · 2025-06-09T13:21:02Z

What does this PR do / why do we need it?

These changes clean up partially downloaded files so that they are in a valid state. Files will revert to stubs for jobs where the file previously existed. Otherwise, the file will be deleted.

Related Issue(s)

Required when applicable.

Where should the reviewer(s) start reviewing this?

Only required for larger PRs when this may not be immediately obvious.

Are there any specific topics we should discuss before merging?

Not required.

What are the next steps after this PR?

Not required.

Checklist before merging:

Required for all PRs.

When creating a PR these are items to keep in mind that cannot be checked by GitHub actions:

Documentation:
- Does developer documentation (code comments, readme, etc.) need to be added or updated?
- Does the user documentation need to be expanded or updated for this change?
Testing:
- Does this functionality require changing or adding new unit tests?
- Does this functionality require changing or adding new integration tests?
Git Hygiene:
- Do commits adhere to Conventional Commits?
- Is the commit history squashed down reasonably?

For more details refer to the Go coding standards and the pull request process.

iamjoemccormick

I found one corner case that may lead to accidental data loss. When forcibly cancelling jobs, if there was ever a download in the job history it will reset the file back to a stub:

$ cat test
hello world

$ beegfs remote job list test
OK  PATH   TARGET  JOB_UPDATED           REQUEST_TYPE    
✅  /test  1       2025-07-08T18:31:24Z  UPLOAD   

$ beegfs remote job cancel test --force
OK  PATH   TARGET  JOB_ID                                REQUEST_TYPE  
🚫  /test  1       8c18cc00-62da-4c1f-8845-84c04ea4b87f  UPLOAD        
🚫  /test  1       7ee0c2fa-1ec5-497c-a88a-76c9010481f6  DOWNLOAD

$ cat test
rst://1:test

This happens even if the download was not the most recent job and was already cancelled.

We could set the job.SetStopMtime(timestamppb.New(mtime)) with a defer after executing all the completeSyncWorkRequests_Download() functionality and only attempt to cleanup incomplete downloads if there is no job.StopMtime set. However this would mean abort is no longer idempotent, and there still might be corner cases where we delete data unintentionally.

I think the "right" way to do this is to download to a temporary file then do an atomic rename when completing the download. Then in the case of an abort we just make sure the temporary file is deleted (and ignore errors if it no longer exists). That ensures aborts can be idempotent. One way for this to work is to use the external_id field to hold the temp path (e.g., .myfile-job-id) for downloads.

… during the download cleanup

iamjoemccormick · 2025-07-16T14:29:10Z

We discussed some of the challenges cleaning up incomplete downloads over Slack and I'm dropping my notes here for the record.

This is not an issue when cancel only affects a single job, but there are scenarios where it can affect multiple jobs (i.e., push for RST 1, pull from RST 2, or force cancellations) that might result in accidental data loss. These were possible approaches we came up with and the issues we identified with each:

Make RST provider functions for cancelling jobs aware if they are the most recent or a historical job and adjust cleanup behavior accordingly.
- Issue: Pushes state and other details into the provider logic that doesn't belong there.
Only reset files to a stub if the file data state is still "stub".
- Issue: There might be multiple historical jobs for different RSTs/paths which could cause the stub file to be reset incorrectly.

Viable options:

Modify cancel to only affect the most recent job unless --all or a specific --job-id is specified.
- Potential concerns: This changes the semantics of job cancel, which might be confusing since the functionality is already released.
Download to a temporary per-job file that can be cleaned up -> This is the most "correct" approach because it also makes downloads atomic.
- Potential concerns: Adds performance overhead (create/rename) and when overwriting an existing file consumes additional space temporarily, which may cause issues with quotas or available capacity.

While leaving around files from failed pulls is arguably not a big deal (the pull can always be rerun with --overwrite), when pulling to rehydrate a stub file it would be ideal to reset the stub file when cancelling/aborting jobs.

swartzn requested a review from a team as a code owner June 9, 2025 13:21

swartzn force-pushed the swartzn/cleanup-partial-downloads branch from 69e6399 to ae5b525 Compare June 9, 2025 13:27

swartzn requested review from sundereshwar and iamjoemccormick and removed request for sundereshwar June 11, 2025 14:55

sundereshwar previously approved these changes Jun 13, 2025

View reviewed changes

swartzn self-assigned this Jun 13, 2025

swartzn force-pushed the swartzn/cleanup-partial-downloads branch 2 times, most recently from 71f4dc6 to c3bf43d Compare June 13, 2025 17:26

iamjoemccormick requested changes Jul 8, 2025

View reviewed changes

swartzn added 2 commits July 9, 2025 07:10

feat(rst): cleanup incomplete downloads

db88640

feat(rst): TODO merge - leave overwritten file unchanged when aborted…

a4922cb

… during the download cleanup

swartzn dismissed sundereshwar’s stale review via a4922cb July 9, 2025 18:56

swartzn force-pushed the swartzn/cleanup-partial-downloads branch from c3bf43d to a4922cb Compare July 9, 2025 18:56

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(rst): cleanup incomplete downloads #196

feat(rst): cleanup incomplete downloads #196

Uh oh!

swartzn commented Jun 9, 2025

Uh oh!

iamjoemccormick left a comment

Uh oh!

iamjoemccormick commented Jul 16, 2025

Uh oh!

Uh oh!

feat(rst): cleanup incomplete downloads #196

Are you sure you want to change the base?

feat(rst): cleanup incomplete downloads #196

Uh oh!

Conversation

swartzn commented Jun 9, 2025

What does this PR do / why do we need it?

Related Issue(s)

Where should the reviewer(s) start reviewing this?

Are there any specific topics we should discuss before merging?

What are the next steps after this PR?

Checklist before merging:

Uh oh!

iamjoemccormick left a comment

Choose a reason for hiding this comment

Uh oh!

iamjoemccormick commented Jul 16, 2025

Uh oh!

Uh oh!