Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reduce ena submission induced db load #2875

Merged
merged 8 commits into from
Sep 25, 2024
Merged

Reduce ena submission induced db load #2875

merged 8 commits into from
Sep 25, 2024

Conversation

anna-parker
Copy link
Contributor

@anna-parker anna-parker commented Sep 25, 2024

resolves #

preview URL: https://reduce-ena-db-load.loculus.org/

Summary

  1. Reduce the size of connection pools (for connections to the postgres db) used by each snakemake rule from max4 to max2.
  2. Increase sleep period after each iteration of snakemake rules from 2 to 10seconds.
  3. Increase period between checking github for new data from 1 to 2min.
  4. Improve error handling when requests to ENA fail (I had tests for this but they were wrong - errors on main currently due to incorrect error handling).
During handling of the above exception, another exception occurred:
...
File "/package/scripts/ena_submission_helper.py", line 392, in check_ena
    f"ENA check failed with status:{response.status_code}. "
                                    ^^^^^^^^
UnboundLocalError: cannot access local variable 'response' where it is not associated with a value

Screenshot

❯ kubectl top pod --sum --containers --all-namespaces | egrep "backend|database|ena-submission" | grep reduce-ena
prev-reduce-ena-db-load         loculus-backend-77c6678886-l9vlv                                  backend                              11m          442Mi
prev-reduce-ena-db-load         loculus-database-668fd89855-vl6xs                                 database                             8m           296Mi
prev-reduce-ena-db-load         loculus-ena-submission-54dfc85d8f-tdqnh                           ena-submission                       1m           158Mi
prev-reduce-ena-db-load         loculus-keycloak-database-684964cd87-jjhvq                        loculus-keycloak-database            1m           34Mi

❯ kubectl top pod --sum --containers --all-namespaces | egrep "backend|database|ena-submission" | grep prev-main
prev-main                       loculus-backend-699874799b-vmjmj                                  backend                              62m          329Mi
prev-main                       loculus-database-b9f6bbd64-gnh7v                                  database                             9m           307Mi
prev-main                       loculus-ena-submission-b7d796cf7-d24p4                            ena-submission                       2m           240Mi
prev-main                       loculus-keycloak-database-78bdf95fb-9zhw4                         loculus-keycloak-database            1m           35Mi

PR Checklist

  • All necessary documentation has been adapted.
  • The implemented feature is covered by an appropriate test.

@anna-parker anna-parker added the preview Triggers a deployment to argocd label Sep 25, 2024
@anna-parker anna-parker marked this pull request as ready for review September 25, 2024 10:06
@anna-parker anna-parker changed the title Reduce db load Reduce ena submission induced db load Sep 25, 2024
@corneliusroemer
Copy link
Contributor

IIUC, we check every 2 minutes for new data? That seems a bit excessive still, though maybe good for testing.

Should we make this configurable? So we can change it easily through values yaml depending on whether one is testing or not?

We should use 304 caching here as well if this is something that's constantly running.

None of this is blocking but might be worth doing before we enable in prod/staging etc

@corneliusroemer
Copy link
Contributor

Maybe I misunderstood. Can you quickly outline which parts poll constantly to which endpoints? There seem to be at least 3 different hosts we talk to:

  1. Our db for a) finding new submissions, b) keeping track of state
  2. GitHub
  3. ENA

Which of these endpoints are talked to every minute or so?

@anna-parker
Copy link
Contributor Author

I poll the postgres db for entries in a specific state (every 10seconds now), then I poll github for data added to https://github.com/pathoplexus/ena-submission (every 2min now), then only after assemblies have been submitted (i.e. there are entries in the assembly_table in state WAITING) I poll ENA for accessions every 5min (not changed by this PR).

@anna-parker
Copy link
Contributor Author

IIUC, we check every 2 minutes for new data? That seems a bit excessive still, though maybe good for testing.
Should we make this configurable? So we can change it easily through values yaml depending on whether one is testing or not?

This is for checking if new data has been uploaded to github - I can make this modifiable :-)

We should use 304 caching here as well if this is something that's constantly running.

I'm not immediately sure how to do this for requests to github but I will look into it!

@corneliusroemer
Copy link
Contributor

I'm not immediately sure how to do this for requests to github but I will look into it!

Ah no - we can't do this with Github, I just meant to implement 304 for repeated loculus db requests we're making.

@corneliusroemer
Copy link
Contributor

I poll the postgres db for entries in a specific state (every 10seconds now), then I poll github for data added to https://github.com/pathoplexus/ena-submission (every 2min now), then only after assemblies have been submitted (i.e. there are entries in the assembly_table in state WAITING) I poll ENA for accessions every 5min (not changed by this PR).

Cool so it's the first 10s poll to the loculus db that we should use the 304 on.

@anna-parker
Copy link
Contributor Author

Cool so it's the first 10s poll to the loculus db that we should use the 304 on.

Ah ok - so this is an actual sql query as I talk directly to the db - but I could add (similar to the table we added for the backend) a trigger table and check if there have been any changes there before performing the sql query for entries in a specific state? Does that make sense?

@corneliusroemer
Copy link
Contributor

I guess those specific queries here are cheap as they are only on submittable sequences of which we have very few at this point - so we can improve efficiency later. I was primarily worried about /get-released-data like expensive queries but they are only run once right now. So all good.

@anna-parker anna-parker merged commit a6a0c9f into main Sep 25, 2024
16 checks passed
@anna-parker anna-parker deleted the reduce_ena_db_load branch September 25, 2024 12:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
preview Triggers a deployment to argocd
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants