Reduce ena submission induced db load #2875

anna-parker · 2024-09-25T09:03:32Z

resolves #

preview URL: https://reduce-ena-db-load.loculus.org/

Summary

Reduce the size of connection pools (for connections to the postgres db) used by each snakemake rule from max4 to max2.
Increase sleep period after each iteration of snakemake rules from 2 to 10seconds.
Increase period between checking github for new data from 1 to 2min.
Improve error handling when requests to ENA fail (I had tests for this but they were wrong - errors on main currently due to incorrect error handling).

During handling of the above exception, another exception occurred:
...
File "/package/scripts/ena_submission_helper.py", line 392, in check_ena
    f"ENA check failed with status:{response.status_code}. "
                                    ^^^^^^^^
UnboundLocalError: cannot access local variable 'response' where it is not associated with a value

Screenshot

❯ kubectl top pod --sum --containers --all-namespaces | egrep "backend|database|ena-submission" | grep reduce-ena
prev-reduce-ena-db-load         loculus-backend-77c6678886-l9vlv                                  backend                              11m          442Mi
prev-reduce-ena-db-load         loculus-database-668fd89855-vl6xs                                 database                             8m           296Mi
prev-reduce-ena-db-load         loculus-ena-submission-54dfc85d8f-tdqnh                           ena-submission                       1m           158Mi
prev-reduce-ena-db-load         loculus-keycloak-database-684964cd87-jjhvq                        loculus-keycloak-database            1m           34Mi

❯ kubectl top pod --sum --containers --all-namespaces | egrep "backend|database|ena-submission" | grep prev-main
prev-main                       loculus-backend-699874799b-vmjmj                                  backend                              62m          329Mi
prev-main                       loculus-database-b9f6bbd64-gnh7v                                  database                             9m           307Mi
prev-main                       loculus-ena-submission-b7d796cf7-d24p4                            ena-submission                       2m           240Mi
prev-main                       loculus-keycloak-database-78bdf95fb-9zhw4                         loculus-keycloak-database            1m           35Mi

PR Checklist

All necessary documentation has been adapted.
The implemented feature is covered by an appropriate test.

… in except)

corneliusroemer · 2024-09-25T10:25:27Z

IIUC, we check every 2 minutes for new data? That seems a bit excessive still, though maybe good for testing.

Should we make this configurable? So we can change it easily through values yaml depending on whether one is testing or not?

We should use 304 caching here as well if this is something that's constantly running.

None of this is blocking but might be worth doing before we enable in prod/staging etc

corneliusroemer · 2024-09-25T10:28:45Z

Maybe I misunderstood. Can you quickly outline which parts poll constantly to which endpoints? There seem to be at least 3 different hosts we talk to:

Our db for a) finding new submissions, b) keeping track of state
GitHub
ENA

Which of these endpoints are talked to every minute or so?

anna-parker · 2024-09-25T11:15:55Z

I poll the postgres db for entries in a specific state (every 10seconds now), then I poll github for data added to https://github.com/pathoplexus/ena-submission (every 2min now), then only after assemblies have been submitted (i.e. there are entries in the assembly_table in state WAITING) I poll ENA for accessions every 5min (not changed by this PR).

anna-parker · 2024-09-25T11:18:40Z

IIUC, we check every 2 minutes for new data? That seems a bit excessive still, though maybe good for testing.
Should we make this configurable? So we can change it easily through values yaml depending on whether one is testing or not?

This is for checking if new data has been uploaded to github - I can make this modifiable :-)

We should use 304 caching here as well if this is something that's constantly running.

I'm not immediately sure how to do this for requests to github but I will look into it!

corneliusroemer · 2024-09-25T11:24:20Z

I'm not immediately sure how to do this for requests to github but I will look into it!

Ah no - we can't do this with Github, I just meant to implement 304 for repeated loculus db requests we're making.

corneliusroemer · 2024-09-25T11:26:00Z

I poll the postgres db for entries in a specific state (every 10seconds now), then I poll github for data added to https://github.com/pathoplexus/ena-submission (every 2min now), then only after assemblies have been submitted (i.e. there are entries in the assembly_table in state WAITING) I poll ENA for accessions every 5min (not changed by this PR).

Cool so it's the first 10s poll to the loculus db that we should use the 304 on.

anna-parker · 2024-09-25T11:28:20Z

Cool so it's the first 10s poll to the loculus db that we should use the 304 on.

Ah ok - so this is an actual sql query as I talk directly to the db - but I could add (similar to the table we added for the backend) a trigger table and check if there have been any changes there before performing the sql query for entries in a specific state? Does that make sense?

corneliusroemer · 2024-09-25T11:46:19Z

I guess those specific queries here are cheap as they are only on submittable sequences of which we have very few at this point - so we can improve efficiency later. I was primarily worried about /get-released-data like expensive queries but they are only run once right now. So all good.

Reduce db load

6a89691

anna-parker added the preview Triggers a deployment to argocd label Sep 25, 2024

anna-parker added 2 commits September 25, 2024 11:39

Fix error handling (if request is assigned in try it is not available…

5fe1b25

… in except)

fix test

33e8a79

anna-parker marked this pull request as ready for review September 25, 2024 10:06

anna-parker changed the title ~~Reduce db load~~ Reduce ena submission induced db load Sep 25, 2024

anna-parker requested review from theosanderson and corneliusroemer September 25, 2024 10:09

Update message

90c9f7a

anna-parker added 2 commits September 25, 2024 13:30

Add sleep time as a flag

d23dbab

Make all polling wait periods customizable via the config

e1ace67

corneliusroemer approved these changes Sep 25, 2024

View reviewed changes

anna-parker added 2 commits September 25, 2024 13:52

Add organism to project creation logs

336e6f8

actually add min-between-ena-checks as a click option

825f47d

anna-parker merged commit a6a0c9f into main Sep 25, 2024
16 checks passed

anna-parker deleted the reduce_ena_db_load branch September 25, 2024 12:04

github-actions bot mentioned this pull request Sep 26, 2024

Update Loculus version to ec5ce2 pathoplexus/pathoplexus#176

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reduce ena submission induced db load #2875

Reduce ena submission induced db load #2875

anna-parker commented Sep 25, 2024 •

edited

Loading

corneliusroemer commented Sep 25, 2024

corneliusroemer commented Sep 25, 2024

anna-parker commented Sep 25, 2024

anna-parker commented Sep 25, 2024

corneliusroemer commented Sep 25, 2024

corneliusroemer commented Sep 25, 2024

anna-parker commented Sep 25, 2024

corneliusroemer commented Sep 25, 2024

Reduce ena submission induced db load #2875

Reduce ena submission induced db load #2875

Conversation

anna-parker commented Sep 25, 2024 • edited Loading

Summary

Screenshot

PR Checklist

corneliusroemer commented Sep 25, 2024

corneliusroemer commented Sep 25, 2024

anna-parker commented Sep 25, 2024

anna-parker commented Sep 25, 2024

corneliusroemer commented Sep 25, 2024

corneliusroemer commented Sep 25, 2024

anna-parker commented Sep 25, 2024

corneliusroemer commented Sep 25, 2024

anna-parker commented Sep 25, 2024 •

edited

Loading