Skip to content

Conversation

swartzn
Copy link
Contributor

@swartzn swartzn commented Jun 9, 2025

What does this PR do / why do we need it?

These changes clean up partially downloaded files so that they are in a valid state. Files will revert to stubs for jobs where the file previously existed. Otherwise, the file will be deleted.

Related Issue(s)

Required when applicable.

Where should the reviewer(s) start reviewing this?

Only required for larger PRs when this may not be immediately obvious.

Are there any specific topics we should discuss before merging?

Not required.

What are the next steps after this PR?

Not required.

Checklist before merging:

Required for all PRs.

When creating a PR these are items to keep in mind that cannot be checked by GitHub actions:

  • Documentation:
    • Does developer documentation (code comments, readme, etc.) need to be added or updated?
    • Does the user documentation need to be expanded or updated for this change?
  • Testing:
    • Does this functionality require changing or adding new unit tests?
    • Does this functionality require changing or adding new integration tests?
  • Git Hygiene:

For more details refer to the Go coding standards and the pull request process.

@swartzn swartzn requested a review from a team as a code owner June 9, 2025 13:21
@swartzn swartzn force-pushed the swartzn/cleanup-partial-downloads branch from 69e6399 to ae5b525 Compare June 9, 2025 13:27
@swartzn swartzn requested review from sundereshwar and iamjoemccormick and removed request for sundereshwar June 11, 2025 14:55
sundereshwar
sundereshwar previously approved these changes Jun 13, 2025
@swartzn swartzn self-assigned this Jun 13, 2025
@swartzn swartzn force-pushed the swartzn/cleanup-partial-downloads branch 2 times, most recently from 71f4dc6 to c3bf43d Compare June 13, 2025 17:26
Copy link
Member

@iamjoemccormick iamjoemccormick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found one corner case that may lead to accidental data loss. When forcibly cancelling jobs, if there was ever a download in the job history it will reset the file back to a stub:

$ cat test
hello world

$ beegfs remote job list test
OK  PATH   TARGET  JOB_UPDATED           REQUEST_TYPE    
✅  /test  1       2025-07-08T18:31:24Z  UPLOAD   

$ beegfs remote job cancel test --force
OK  PATH   TARGET  JOB_ID                                REQUEST_TYPE  
🚫  /test  1       8c18cc00-62da-4c1f-8845-84c04ea4b87f  UPLOAD        
🚫  /test  1       7ee0c2fa-1ec5-497c-a88a-76c9010481f6  DOWNLOAD

$ cat test
rst://1:test

This happens even if the download was not the most recent job and was already cancelled.

We could set the job.SetStopMtime(timestamppb.New(mtime)) with a defer after executing all the completeSyncWorkRequests_Download() functionality and only attempt to cleanup incomplete downloads if there is no job.StopMtime set. However this would mean abort is no longer idempotent, and there still might be corner cases where we delete data unintentionally.

I think the "right" way to do this is to download to a temporary file then do an atomic rename when completing the download. Then in the case of an abort we just make sure the temporary file is deleted (and ignore errors if it no longer exists). That ensures aborts can be idempotent. One way for this to work is to use the external_id field to hold the temp path (e.g., .myfile-job-id) for downloads.

@swartzn swartzn force-pushed the swartzn/cleanup-partial-downloads branch from c3bf43d to a4922cb Compare July 9, 2025 18:56
@iamjoemccormick
Copy link
Member

We discussed some of the challenges cleaning up incomplete downloads over Slack and I'm dropping my notes here for the record.

This is not an issue when cancel only affects a single job, but there are scenarios where it can affect multiple jobs (i.e., push for RST 1, pull from RST 2, or force cancellations) that might result in accidental data loss. These were possible approaches we came up with and the issues we identified with each:

  • Make RST provider functions for cancelling jobs aware if they are the most recent or a historical job and adjust cleanup behavior accordingly.

    • Issue: Pushes state and other details into the provider logic that doesn't belong there.
  • Only reset files to a stub if the file data state is still "stub".

    • Issue: There might be multiple historical jobs for different RSTs/paths which could cause the stub file to be reset incorrectly.

Viable options:

  • Modify cancel to only affect the most recent job unless --all or a specific --job-id is specified.

    • Potential concerns: This changes the semantics of job cancel, which might be confusing since the functionality is already released.
  • Download to a temporary per-job file that can be cleaned up -> This is the most "correct" approach because it also makes downloads atomic.

    • Potential concerns: Adds performance overhead (create/rename) and when overwriting an existing file consumes additional space temporarily, which may cause issues with quotas or available capacity.

While leaving around files from failed pulls is arguably not a big deal (the pull can always be rerun with --overwrite), when pulling to rehydrate a stub file it would be ideal to reset the stub file when cancelling/aborting jobs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants