Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(ena-submission): Create insdc submission service #2186

Merged
merged 56 commits into from
Jul 23, 2024

Conversation

anna-parker
Copy link
Contributor

@anna-parker anna-parker commented Jun 26, 2024

resolves #2264

preview URL: https://ena-submission-pod.loculus.org/

Summary

This is the first step in creating an INSDC submission service. This pod has the authorization of external_metadata_updater.

At pod initialization:

  • Flyway migrate will run and if not already existing it will create a new schema ena-submission inside the loculus database. This schema will include 4 tables: 1 with general state of submission to ENA, and then one for each stage of the submission process: project, sample and assembly creation.
  • Flyway runs in its own docker container. This allows us to keep the sql table version files in a folder inside the ena-submission folder.

After initialization is complete the pod runs

  • snakemake get_ena_submission_list: For the moment this just calls the get-released-data endpoint for data in state APPROVED_FOR_RELEASE. The data is then filtered:
    • data must be state "OPEN" for use
    • data must not already exist in ENA or be in the submission process.To prevent this we need to make sure:
      • data was not submitted by the config.ingest_pipeline_submitter
      • data is not in submission_table
      • as an extra check we discard all sequences with ena-specific-metadata fields (if users uploaded correctly this should not be needed)
  • Ideally the sequences in ena_submission_list.json would then be manually approved and written to a new file ena_submission_approved.json and we would continue with the upload to ENA.

This PR also contains the rule submit_external_metadata:

Screenshot

PR Checklist

  • All necessary documentation has been adapted.
  • Use flyway and make sure pods can start
  • Kubernetes configs: have one ena-submission-pod with configs for all organisms
  • Use external_metadata parameters for get_ena_submission_list. Check function works locally with data to submit.
  • Merge in work done by @corneliusroemer in Document how to submit to ENA #331

For future PRs: (these should also be performed in a loop to check for changing state)

  • Create project using getgroup information and add state in project_tables
  • Create sample using metadata information and add state in sample_tables.
  • Create assembly using metadata information and add state in assembly_tables.
  • The implemented feature is covered by an appropriate test.

@anna-parker anna-parker added the preview Triggers a deployment to argocd label Jul 10, 2024
@anna-parker
Copy link
Contributor Author

The ena-submission pods are stuck in waiting to start: PodInitializing - I assume this is sth flyway related. I will remove the preview label to save computation power and try to debug locally tomorrow.

@anna-parker anna-parker removed the preview Triggers a deployment to argocd label Jul 10, 2024
@theosanderson theosanderson added the preview Triggers a deployment to argocd label Jul 10, 2024
@theosanderson
Copy link
Member

(just adding it back to take a look)

@theosanderson
Copy link
Member

To see what's going on you pick the container in the drop down
image

It can't find the image I think

@theosanderson theosanderson added preview Triggers a deployment to argocd and removed preview Triggers a deployment to argocd labels Jul 10, 2024
@anna-parker anna-parker marked this pull request as ready for review July 12, 2024 16:53
deploy.py Outdated Show resolved Hide resolved
Copy link
Member

@theosanderson theosanderson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll have another pass over things tomorrow –for now just to say that in addition to my nitpicks this is really exciting and looks great

@anna-parker
Copy link
Contributor Author

anna-parker commented Jul 15, 2024

One additional thought: the get-released-data endpoint returns the entire alignment as well as metadata fields. This is a lot of data to process. As the ena-submission-pod anyways has database access - do think it would be better if the ena-submission-pod reads the main view of the loculus DB and performs these queries there? I am not sure if this would be bad design for a microservice but I could see it being a big performance increase. Tagging @chaoran-chen for this

@anna-parker
Copy link
Contributor Author

One additional thought: the get-released-data endpoint returns the entire alignment as well as metadata fields. This is a lot of data to process. As the ena-submission-pod anyways has database access - do think it would be better if the ena-submission-pod reads the main view of the loculus DB and performs these queries there? I am not sure if this would be bad design for a microservice but I could see it being a big performance increase. Tagging @chaoran-chen for this

Summary of offline discussion: Querying the DB directly is indeed discouraged for microservices. Instead we can use the sample/details endpoint in LAPIS (https://lapis-main.loculus.org/ebola-sudan) to get all specified metadata fields of sequences which match given features: e.g.

  • that are APPROVED_FOR_RELEASE
  • also importantly they need to have "dataUseTerms": "OPEN" - I missed this - wupps
  • potentially "submitter": "insdc_ingest_user"
  • potentially does not include ena-specific metadata fields.

@corneliusroemer
Copy link
Contributor

I don't like the idea of going via LAPIS, it adds a whole bunch of intermediary steps, latency, potential of corruption/failure etc.

Regarding potential inefficiency of calling get-released-sequences, I think worrying about this is squarely in the realm of premature optimization. Why?

  1. Silo import calls this endpoint all the time already, if this endpoint is a bottleneck, we should make it more efficient whether it's used also by ena submission or not
  2. For current pathogens where we would use submission, the amount of data is small, at most 1GB uncompressed. Any filtering can be done on the fly if we don't want to write to disk or save in memory. That isn't required now in any way we can do later if needed.
  3. We could easily add a few filters to the backend endpoint to send fewer fields and or filter, as an alternative to the caller filtering.

Why avoid LAPIS?

  1. Delay (requires full silo ingest to run, if we have millions of sequences, one might do it only daily)
  2. More complex dev setup: now you need lapis running to be able to test/develop ena submission, not just backend
  3. harder to debug: things can now get lost in many more places
    ...

@anna-parker
Copy link
Contributor Author

anna-parker commented Jul 15, 2024

Just discussed this with @corneliusroemer offline. The backend is our source of truth wrt state of sequences and release state. Querying LAPIS instead of the backend for the state of sequences adds an additional layer of complexity and possible issues.

If get-released-data does in fact turn into a bottleneck we can optimize the code then - adding a filtering option to the get-released-data endpoint might be a better choice than querying LAPIS.

For now I think it makes most sense for me to update the code to filter out sequences from the get-released-data response (and check for "dataUseTerms": "OPEN" as I missed this). I will create a follow up issue to look into optimizing this code.

@chaoran-chen please let me know if you have any issues with this - we can also discuss more tomorrow as this seems like an important design issue :-)

@chaoran-chen
Copy link
Member

Okay, sounds good to me!

Copy link
Member

@theosanderson theosanderson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this

@anna-parker anna-parker merged commit 6fd7e65 into main Jul 23, 2024
12 checks passed
@anna-parker anna-parker deleted the ena_submission_pod branch July 23, 2024 17:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
preview Triggers a deployment to argocd
Projects
None yet
Development

Successfully merging this pull request may close these issues.

ENA submission
4 participants