feat(ena-submission): Create insdc submission service #2186

anna-parker · 2024-06-26T16:33:13Z

resolves #2264

preview URL: https://ena-submission-pod.loculus.org/

Summary

This is the first step in creating an INSDC submission service. This pod has the authorization of external_metadata_updater.

At pod initialization:

Flyway migrate will run and if not already existing it will create a new schema ena-submission inside the loculus database. This schema will include 4 tables: 1 with general state of submission to ENA, and then one for each stage of the submission process: project, sample and assembly creation.
Flyway runs in its own docker container. This allows us to keep the sql table version files in a folder inside the ena-submission folder.

After initialization is complete the pod runs

snakemake get_ena_submission_list: For the moment this just calls the get-released-data endpoint for data in state APPROVED_FOR_RELEASE. The data is then filtered:
- data must be state "OPEN" for use
- data must not already exist in ENA or be in the submission process.To prevent this we need to make sure:
  - data was not submitted by the config.ingest_pipeline_submitter
  - data is not in submission_table
  - as an extra check we discard all sequences with ena-specific-metadata fields (if users uploaded correctly this should not be needed)
Ideally the sequences in ena_submission_list.json would then be manually approved and written to a new file ena_submission_approved.json and we would continue with the upload to ENA.

This PR also contains the rule submit_external_metadata:

Function to submit external metadata to the backend (using feat(backend): Add an end-point and metadata table for results of ENA Submission #2146) once submission is complete.

Screenshot

PR Checklist

All necessary documentation has been adapted.
Use flyway and make sure pods can start
Kubernetes configs: have one ena-submission-pod with configs for all organisms
Use external_metadata parameters for get_ena_submission_list. Check function works locally with data to submit.
Merge in work done by @corneliusroemer in Document how to submit to ENA #331

For future PRs: (these should also be performed in a loop to check for changing state)

Create project using getgroup information and add state in project_tables
Create sample using metadata information and add state in sample_tables.
Create assembly using metadata information and add state in assembly_tables.
The implemented feature is covered by an appropriate test.

…hub actions.

anna-parker · 2024-07-10T16:22:09Z

The ena-submission pods are stuck in waiting to start: PodInitializing - I assume this is sth flyway related. I will remove the preview label to save computation power and try to debug locally tomorrow.

theosanderson · 2024-07-10T16:33:11Z

(just adding it back to take a look)

theosanderson · 2024-07-10T17:17:28Z

To see what's going on you pick the container in the drop down

It can't find the image I think

…ater PR

deploy.py

ena-submission/ENA_submission.md

ena-submission/scripts/call_loculus.py

theosanderson

I'll have another pass over things tomorrow –for now just to say that in addition to my nitpicks this is really exciting and looks great

anna-parker · 2024-07-15T06:12:09Z

One additional thought: the get-released-data endpoint returns the entire alignment as well as metadata fields. This is a lot of data to process. As the ena-submission-pod anyways has database access - do think it would be better if the ena-submission-pod reads the main view of the loculus DB and performs these queries there? I am not sure if this would be bad design for a microservice but I could see it being a big performance increase. Tagging @chaoran-chen for this

anna-parker · 2024-07-15T15:35:23Z

One additional thought: the get-released-data endpoint returns the entire alignment as well as metadata fields. This is a lot of data to process. As the ena-submission-pod anyways has database access - do think it would be better if the ena-submission-pod reads the main view of the loculus DB and performs these queries there? I am not sure if this would be bad design for a microservice but I could see it being a big performance increase. Tagging @chaoran-chen for this

Summary of offline discussion: Querying the DB directly is indeed discouraged for microservices. Instead we can use the sample/details endpoint in LAPIS (https://lapis-main.loculus.org/ebola-sudan) to get all specified metadata fields of sequences which match given features: e.g.

that are APPROVED_FOR_RELEASE
also importantly they need to have "dataUseTerms": "OPEN" - I missed this - wupps
potentially "submitter": "insdc_ingest_user"
potentially does not include ena-specific metadata fields.

corneliusroemer · 2024-07-15T17:12:53Z

I don't like the idea of going via LAPIS, it adds a whole bunch of intermediary steps, latency, potential of corruption/failure etc.

Regarding potential inefficiency of calling get-released-sequences, I think worrying about this is squarely in the realm of premature optimization. Why?

Silo import calls this endpoint all the time already, if this endpoint is a bottleneck, we should make it more efficient whether it's used also by ena submission or not
For current pathogens where we would use submission, the amount of data is small, at most 1GB uncompressed. Any filtering can be done on the fly if we don't want to write to disk or save in memory. That isn't required now in any way we can do later if needed.
We could easily add a few filters to the backend endpoint to send fewer fields and or filter, as an alternative to the caller filtering.

Why avoid LAPIS?

Delay (requires full silo ingest to run, if we have millions of sequences, one might do it only daily)
More complex dev setup: now you need lapis running to be able to test/develop ena submission, not just backend
harder to debug: things can now get lost in many more places
...

anna-parker · 2024-07-15T20:13:10Z

Just discussed this with @corneliusroemer offline. The backend is our source of truth wrt state of sequences and release state. Querying LAPIS instead of the backend for the state of sequences adds an additional layer of complexity and possible issues.

If get-released-data does in fact turn into a bottleneck we can optimize the code then - adding a filtering option to the get-released-data endpoint might be a better choice than querying LAPIS.

For now I think it makes most sense for me to update the code to filter out sequences from the get-released-data response (and check for "dataUseTerms": "OPEN" as I missed this). I will create a follow up issue to look into optimizing this code.

@chaoran-chen please let me know if you have any issues with this - we can also discuss more tomorrow as this seems like an important design issue :-)

chaoran-chen · 2024-07-15T20:16:24Z

Okay, sounds good to me!

kubernetes/loculus/templates/ena-submission-config.yaml

ena-submission/Snakefile

…mission_pod

theosanderson

Thanks for this

anna-parker force-pushed the ena_submission_pod branch from 8ac82cb to 6ace49f Compare July 8, 2024 09:12

anna-parker and others added 11 commits July 10, 2024 13:38

Add an ena submission pod

1ab1029

Add to ena-submission image to build-arm-images.

4551811

docs(submission): stub

2b4a9a5

Add metadata model

b5dc0b3

How to register a study programatically

220113b

Add ENA sample submission details

42eed3c

Add assembly submission details

de64dcf

Merge ena submission docs

f4be0cb

Change user name to ExternalMetadataUpdater.

c3f888f

Update create_project_xml.py

11d009c

Add center_name to create_project_xml.py

0bf88ad

anna-parker force-pushed the ena_submission_pod branch from a6f600c to 0bf88ad Compare July 10, 2024 11:38

theosanderson and others added 4 commits July 10, 2024 14:23

Update ena-submission-deployment.yaml

deca3f6

Update ena-submission-deployment.yaml

a73cc9d

Update ena-submission-deployment.yaml

6a859bb

Add flyway with conf and sql into its own docker image and add to git…

0e8d936

…hub actions.

anna-parker added the preview Triggers a deployment to argocd label Jul 10, 2024

anna-parker added 4 commits July 10, 2024 16:25

Check if adding to build-arm-images helps

7285d97

Fix config indentation.

d013f39

Create a kubernetes config for the ena-submission pod.

f5d0f6b

Update README.md

ccd11a4

anna-parker removed the preview Triggers a deployment to argocd label Jul 10, 2024

theosanderson added the preview Triggers a deployment to argocd label Jul 10, 2024

theosanderson added preview Triggers a deployment to argocd and removed preview Triggers a deployment to argocd labels Jul 10, 2024

anna-parker requested review from corneliusroemer, theosanderson and chaoran-chen July 12, 2024 16:52

anna-parker marked this pull request as ready for review July 12, 2024 16:53

Remove project creation related scripts -> these will be added in a l…

32e20e1

…ater PR

theosanderson reviewed Jul 14, 2024

View reviewed changes

deploy.py Outdated Show resolved Hide resolved

theosanderson reviewed Jul 14, 2024

View reviewed changes

ena-submission/ENA_submission.md Outdated Show resolved Hide resolved

theosanderson reviewed Jul 14, 2024

View reviewed changes

ena-submission/ENA_submission.md Show resolved Hide resolved

theosanderson reviewed Jul 14, 2024

View reviewed changes

ena-submission/scripts/call_loculus.py Outdated Show resolved Hide resolved

theosanderson reviewed Jul 15, 2024

View reviewed changes

Add suggestions

abb7835

Check that data is also open.

18203ce

theosanderson reviewed Jul 16, 2024

View reviewed changes

kubernetes/loculus/templates/ena-submission-config.yaml Outdated Show resolved Hide resolved

theosanderson reviewed Jul 16, 2024

View reviewed changes

ena-submission/Snakefile Outdated Show resolved Hide resolved

anna-parker added 7 commits July 16, 2024 16:32

Change print to log

340564e

Make helper functions clearer

66ed3c0

Merge branch 'main' into ena_submission_pod

ad4a614

Refactor

a20080f

Fix configs

09c91b9

Fix order

438103a

Merge remote-tracking branch 'origin/ena_submission_pod' into ena_sub…

ee0ffe4

…mission_pod

anna-parker requested a review from theosanderson July 23, 2024 09:51

theosanderson approved these changes Jul 23, 2024

View reviewed changes

anna-parker merged commit 6fd7e65 into main Jul 23, 2024
12 checks passed

anna-parker deleted the ena_submission_pod branch July 23, 2024 17:12

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(ena-submission): Create insdc submission service #2186

feat(ena-submission): Create insdc submission service #2186

anna-parker commented Jun 26, 2024 •

edited

Loading

anna-parker commented Jul 10, 2024

theosanderson commented Jul 10, 2024

theosanderson commented Jul 10, 2024

theosanderson left a comment

anna-parker commented Jul 15, 2024 •

edited

Loading

anna-parker commented Jul 15, 2024

corneliusroemer commented Jul 15, 2024

anna-parker commented Jul 15, 2024 •

edited

Loading

chaoran-chen commented Jul 15, 2024

theosanderson left a comment

feat(ena-submission): Create insdc submission service #2186

feat(ena-submission): Create insdc submission service #2186

Conversation

anna-parker commented Jun 26, 2024 • edited Loading

Summary

Screenshot

PR Checklist

For future PRs: (these should also be performed in a loop to check for changing state)

anna-parker commented Jul 10, 2024

theosanderson commented Jul 10, 2024

theosanderson commented Jul 10, 2024

theosanderson left a comment

Choose a reason for hiding this comment

anna-parker commented Jul 15, 2024 • edited Loading

anna-parker commented Jul 15, 2024

corneliusroemer commented Jul 15, 2024

anna-parker commented Jul 15, 2024 • edited Loading

chaoran-chen commented Jul 15, 2024

theosanderson left a comment

Choose a reason for hiding this comment

anna-parker commented Jun 26, 2024 •

edited

Loading

anna-parker commented Jul 15, 2024 •

edited

Loading

anna-parker commented Jul 15, 2024 •

edited

Loading