Skip to content

[ENG-8175] Fixed potential race condition #11179

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Conversation

ihorsokhanexoft
Copy link
Contributor

@ihorsokhanexoft ihorsokhanexoft commented Jun 11, 2025

Purpose

User received an email of archive failure regardless of it was successful. It may happen because archive_success is run asynchronously

@project_signals.archive_callback.connect
def archive_callback(dst):
"""Blinker listener for updates to the archive task. When the tree of ArchiveJob
instances is complete, proceed to send success or failure mails
:param dst: registration Node
"""
root = dst.root
root_job = root.archive_job
if not root_job.archive_tree_finished():
return
if root_job.sent:
return
if root_job.success:
# Prevent circular import with app.py
from website.archiver import tasks
tasks.archive_success.delay(dst_pk=root._id, job_pk=root_job._id)

and may be finished before the main thread finishes archiving process. So at first archive_success is processed and no files found, thus email is sent, then the main thread finishes files processing and this archiving is successful actually.
Also it's possible that the celery queue had too many tasks to process and when the main thread finishes archiving, user sees his registration and when celery processes archive_success tasks that fails, user receives this email.

Changes

Added a one-time retry.

Ticket

https://openscience.atlassian.net/browse/ENG-8175?atlOrigin=eyJpIjoiMjg4MWM1YWI1ZTE3NDMyZmEyODk2Y2QxZjlhNjFlOGQiLCJwIjoiaiJ9

@ihorsokhanexoft ihorsokhanexoft changed the base branch from develop to feature/pbs-25-10 June 11, 2025 10:40
Copy link
Collaborator

@brianjgeiger brianjgeiger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just one question.

'missing_files': err.missing_files,
},
)
self.retry(exc=err)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the purpose of the retry? Are you expecting a situation where the files will be found on the retry?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, it looks like this celery task is run faster than attaching files to a node, this is why they cannot be found. Why I think so because Mark mentioned that despite a user got email with "some files were altered" message, these files were actually included in the node registration, so maybe there was some retry and this celery task was re-run or Matt ran it manually. So if we don't find some files, it may be related to race condition and we make one more attempt in 5 minutes. If it doesn't work -> raise error
In order to avoid this race condition I think we can make this task synchronous, but in this case user'll need to wait a bit more time on registration page to not abort a request

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I described this issue to Matt and it's a known bug:

Issue appears to be a known bug wherein those file_urls don't get updated after registration is created, leaving those links pointing to the same file on the node. Unsure about the exact cause, but usually just calling migrate_file_metadata works. Would you like me to run that against it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And migrate_file_metadata is a problem place that checks if files are found

@brianjgeiger brianjgeiger changed the base branch from feature/pbs-25-10 to feature/pbs-25-13 June 30, 2025 17:04
@brianjgeiger brianjgeiger merged commit 6c23802 into CenterForOpenScience:feature/pbs-25-13 Jun 30, 2025
6 checks passed
Johnetordoff added a commit to Johnetordoff/osf.io that referenced this pull request Jul 16, 2025
…cience/osf.io into add-brand-to-collection-provider

* 'feature/pbs-25-13' of https://github.com/CenterForOpenScience/osf.io: (25 commits)
  [ENG-8224] Fixed force archive template with registration addons (CenterForOpenScience#11210)
  API: Allow /v2/users/me/preprints list view to filter by title
  [ENG-8325] Public column does not display the visibility status of child nodes on the Nodes page in the Admin App
  [ENG-8246] Fixed deletion of maintenance alerts in admin (CenterForOpenScience#11226)
  add exception handling to /review_actions/ endpoint
  added a route to download node metadata (CenterForOpenScience#11215)
  [ENG-7929] Ability to move registrations to draft state (CenterForOpenScience#11153)
  switch to new UI when user views draft registration file (CenterForOpenScience#11144)
  [ENG-5862] SPAM - Fix Wiki Spamming (CenterForOpenScience#11171)
  improved displaying of stashed urls and approval state in admin (CenterForOpenScience#11193)
  [ENG-7962] Fix User Setting Response Payload async mailchimp perference change issues (CenterForOpenScience#11136)
  fixed children/parent fields in admin templates (CenterForOpenScience#11156)
  don't add multiple group perms for preprint provider (CenterForOpenScience#11159)
  fix content overflow for node page (CenterForOpenScience#11182)
  [ENG-8096] Admins on projects are unable to reject user access requests (CenterForOpenScience#11163)
  added retry to avoid race condition (CenterForOpenScience#11179)
  upgrade django to 4.2.17 (CenterForOpenScience#11173)
  add additional information to user admin (CenterForOpenScience#11184)
  [ENG-8192] Ability to force archive registrations when OSFS Folders have changed (CenterForOpenScience#11194)
  [ENG-8193] Fix issues with Preprint submission via API (CenterForOpenScience#11185)
  ...
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants