Skip to content

Conversation

@jvwong
Copy link
Member

@jvwong jvwong commented Mar 7, 2022

Two changes to the cron job for document updates.

  1. Skip updates to relatedPapers: Drop this rather heavy, long-running task (see Examining the database data / Root of large data dump size  #1024 (comment)). It's not completely clear that updates to these make any tangible difference (e.g. the most recent papers of interest).

  2. Filtering docs to update: Strips out those documents without any paperId, that is, documents that were auto-created to notify referenced authors that their paper was cited by a Biofactoid document (viral email).

In this way, documents will get fresh data for the submitted article, and those submitting a paper not in PubMed at the time will be checked as well.

Some empirical analysis to follow up on (#1024 (comment)):

Condition Size (MB) Size (bytes) Delta Elapsed Time
Base 104.76 104759158
CRON job (unstable) 246.22 246224854 135% ~24 hours
CRON job (this branch) 105.27 105268838 0.4% ~1 minute

Refs #1024

@jvwong jvwong requested a review from maxkfranz March 7, 2022 21:38
@maxkfranz
Copy link
Member

  1. Skip updates to relatedPapers: Drop this rather heavy, long-running task (see Examining the database data / Root of large data dump size  #1024 (comment)). It's not completely clear that updates to these make any tangible difference (e.g. the most recent papers of interest).

It will be worth considering how that process could be reworked generally w.r.t. future efforts re. recency. This is good for now, though we'll have to map this out further in future

@jvwong jvwong merged commit 7313c84 into unstable Mar 8, 2022
@jvwong jvwong deleted the iss1024_cron-update-tasks branch March 11, 2022 19:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants