Skip to content

Commit

Permalink
feat(ena-submission): Add INSDC accession to results of ena submissio…
Browse files Browse the repository at this point in the history
…n and upload to backend. (#2845)

* Add INSDC accession base and accession full to results of ena submission and upload to backend.

* Update readme with more tips on local testing and warning

* Refactor to make test a flag and local testing clearer.

* Wait longer when retrying if get_group_info fails.
  • Loading branch information
anna-parker authored Sep 24, 2024
1 parent 37f4ab8 commit b264449
Show file tree
Hide file tree
Showing 8 changed files with 277 additions and 55 deletions.
108 changes: 98 additions & 10 deletions ena-submission/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -151,6 +151,10 @@ Then run snakemake using `snakemake` or `snakemake {rule}`.

## Testing

> [!WARNING]
> When testing always submit to ENA's test/dev instance. This means for XML post requests (i.e. for project and sample creation), sending them to `https://wwwdev.ebi.ac.uk/ena` and for webin-cli requests (i.e. assembly creation) adding the `-test` flag. This is done automatically when the `submit_to_ena_prod` is set to False (which is the default). Do not change this flag locally unless you know what you are doing.
> Using our ENA test account does **not** affect which ENA instance you submit to, if you use our test account and submit to ENA production you will have officially submitted samples to ENA.
### Run tests

```sh
Expand All @@ -160,19 +164,103 @@ python3 scripts/test_ena_submission.py

### Testing submission locally

ENA-submission currently is only triggered after manual approval.
1. Run loculus locally (need prepro, backend and ena-submission pod), e.g.

```sh
../deploy.py cluster --dev
../deploy.py helm --dev --enablePreprocessing
../generate_local_test_config.sh
cd ../backend
./start_dev.sh &
cd ../ena-submission
micromamba activate loculus-ena-submission
flyway -user=postgres -password=unsecure -url=jdbc:postgresql://127.0.0.1:5432/loculus -schemas=ena-submission -locations=filesystem:./flyway/sql migrate
```

2. Submit data to the backend as test user (create group, submit and approve), e.g. using [example data](https://github.com/pathoplexus/example_data). (To test the full submission cycle with insdc accessions submit cchf example data with only 2 segments.)

```sh
KEYCLOAK_TOKEN_URL="http://localhost:8083/realms/loculus/protocol/openid-connect/token"
KEYCLOAK_CLIENT_ID="backend-client"
usernameAndPassword="testuser"
jwt_keycloak=$(curl -X POST "$KEYCLOAK_TOKEN_URL" --fail-with-body -H 'Content-Type: application/x-www-form-urlencoded' -d "username=$usernameAndPassword&password=$usernameAndPassword&grant_type=password&client_id=$KEYCLOAK_CLIENT_ID")
JWT=$(echo "$jwt_keycloak" | jq -r '.access_token')
curl -X 'POST' 'http://localhost:8079/groups' \
-H 'accept: application/json' \
-H "Authorization: Bearer ${JWT}" \
-H 'Content-Type: application/json' \
-d '{
"groupName": "ENA submission Group",
"institution": "University of Loculus",
"address": {
"line1": "1234 Loculus Street",
"line2": "Apt 1",
"city": "Dortmund",
"state": "NRW",
"postalCode": "12345",
"country": "Germany"
},
"contactEmail": "something@loculus.org"}'
LOCULUS_ACCESSION = $(curl -X 'POST' \
'http://localhost:8079/cchf/submit?groupId=1&dataUseTermsType=OPEN' \
-H 'accept: application/json' \
-H "Authorization: Bearer ${JWT}" \
-H 'Content-Type: multipart/form-data' \
-F 'metadataFile=@../../example_data/example_files/cchfv_test_metadata.tsv;type=text/tab-separated-values' \
-F 'sequenceFile=@../../example_data/example_files/cchfv_test_sequences.fasta' | jq -r '.[0].accession')
curl -X 'POST' \
'http://localhost:8079/cchf/approve-processed-data' \
-H 'accept: application/json' \
-H "Authorization: Bearer ${JWT}"
-H 'Content-Type: application/json' \
-d '{"scope": "ALL"}'
```

3. Get list of sequences ready to submit to ENA, locally this will write `results/ena_submission_list.json`.

```sh
snakemake get_ena_submission_list
```

4. Check contents and then rename to `results/approved_ena_submission_list.json`, trigger ena submission by adding entries to the submission table

```sh
cp results/ena_submission_list.json results/approved_ena_submission_list.json
snakemake trigger_submission_to_ena_from_file
```

Alternatively you can upload data to the [test folder](https://github.com/pathoplexus/ena-submission/blob/main/test/approved_ena_submission_list.json) and run `snakemake trigger_submission_to_ena`.

5. Create project, sample and assembly: `snakemake results/project_created results/sample_created results/assembly_created` - you will need the credentials of the ENA test submission account for this. (You can terminate the rules after you see assembly creation has been successful, or earlier if you see errors.)

6. Note that ENA's dev server does not always finish processing and you might not receive a `gcaAccession` for your dev submissions. If you would like to test the full submission cycle on the ENA dev instance it makes sense to manually alter the gcaAccession in the database to `ERZ24784470` (a known test submission with 2 chromosomes/segments - sadly ERZ accessions are private so I do not have other test examples). You can do this after connecting via pgAdmin or connecting via the CLI:

```sh
psql -h 127.0.0.1:5432 -U postgres -d loculus
```

Then perform the update:

```sql
SET search_path TO "ena-submission";
UPDATE assembly_table
SET result = '{"erz_accession": "ERZ24784470", "segment_order": ["L", "M"]}'::jsonb
WHERE accession = '$LOCULUS_ACCESSION';
```

Exit `psql` using `\q`.

The `get_ena_submission_list` runs as a cron-job. It queries Loculus for new sequences to submit to ENA (these are sequences that are in state OPEN, were not submitted by the INSDC_INGEST_USER, do not include ena external_metadata fields and are not yet in the submission_table of the ena-submission schema). If it finds new sequences it sends a notification to slack with all sequences.
7. Upload to loculus (you can run the webpage locally if you would like to see this visually), `snakemake results/assembly_created results/uploaded_external_metadata`.

It is then the reviewer's turn to review these sequences. [TODO: define review criteria] If these sequences meet our criteria they should be uploaded to [pathoplexus/ena-submission](https://github.com/pathoplexus/ena-submission/blob/main/approved/approved_ena_submission_list.json) (currently we read data from the [test folder](https://github.com/pathoplexus/ena-submission/blob/main/test/approved_ena_submission_list.json) - but this will be changed to the `approved` folder in production). The `trigger_submission_to_ena` rule is constantly checking this folder for new sequences and adding them to the submission_table if they are not already there. Note we cannot yet handle revisions so these should not be added to the approved list [TODO: do not allow submission of revised sequences in `trigger_submission_to_ena`]- revisions will still have to be performed manually.
If you experience issues you can look at the database locally using pgAdmin. On local instances the password is `unsecure`.

If you would like to test `trigger_submission_to_ena` while running locally you can also use the `trigger_submission_to_ena_from_file` rule, this will read in data from `results/approved_ena_submission_list.json` (see the test folder for an example). You can also upload data to the [test folder](https://github.com/pathoplexus/ena-submission/blob/main/test/approved_ena_submission_list.json) - note that if you add fake data with a non-existent group-id the project creation will fail, additionally the `upload_to_loculus` rule will fail if these sequences do not actually exist in your loculus instance.
### Testing submission on a preview instance

All other rules query the `submission_table` for projects/samples and assemblies to submit. Once successful they add accessions to the `results` column in dictionary format. Finally, once the entire process has succeeded the new external metadata will be uploaded to Loculus.
1. Upload data to the [test folder](https://github.com/pathoplexus/ena-submission/blob/main/test/approved_ena_submission_list.json) - note that if you add fake data with a non-existent group-id the project creation will fail, additionally the `upload_to_loculus` rule will fail if these sequences do not actually exist in your loculus instance.

Note that ENA's dev server does not always finish processing and you might not receive a gcaAccession for your dev submissions. If you would like to test the full submission cycle on the ENA dev instance it makes sense to manually alter the gcaAccession in the database using `ERZ24784470`. You can connect to a preview instance via port forwarding to these changes on local database tool such as pgAdmin:
2. Connect to the database of the preview instance via port forwarding using a database tool such as pgAdmin:

1. Apply the preview `~/.kube/config`
2. Find the database POD using `kubectl get pods -A | grep database`
3. Connect via port-forwarding `kubectl port-forward $POD -n $NAMESPACE 5432:5432`
4. If necessary find password using `kubectl get secret`
- Apply the preview `~/.kube/config`
- Find the database POD using `kubectl get pods -A | grep database`
- Connect via port-forwarding `kubectl port-forward $POD -n $NAMESPACE 5432:5432`
- If necessary find password using `kubectl get secret`
29 changes: 27 additions & 2 deletions ena-submission/Snakefile
Original file line number Diff line number Diff line change
Expand Up @@ -13,12 +13,31 @@ for key, value in defaults.items():
if not key in config:
config[key] = value

LOG_LEVEL = config.get("log_level", "INFO")
SUBMIT_TO_ENA_PROD = config.get("submit_to_ena_prod", False)
SUBMIT_TO_ENA_DEV = not SUBMIT_TO_ENA_PROD

if SUBMIT_TO_ENA_DEV:
print("Submitting to ENA dev environment")
config["ena_submission_url"] = "https://wwwdev.ebi.ac.uk/ena/submit/drop-box/submit"
config["github_url"] = (
"https://raw.githubusercontent.com/pathoplexus/ena-submission/main/test/approved_ena_submission_list.json"
)
config["ena_reports_service_url"] = "https://wwwdev.ebi.ac.uk/ena/submit/report"

if SUBMIT_TO_ENA_PROD:
print("WARNING: Submitting to ENA production")
config["ena_submission_url"] = "https://www.ebi.ac.uk/ena/submit/drop-box/submit"
config["github_url"] = (
"https://raw.githubusercontent.com/pathoplexus/ena-submission/main/approved/approved_ena_submission_list.json"
)
config["ena_reports_service_url"] = "https://www.ebi.ac.uk/ena/submit/report"


Path("results").mkdir(parents=True, exist_ok=True)
with open("results/config.yaml", "w") as f:
f.write(yaml.dump(config))

LOG_LEVEL = config.get("log_level", "INFO")


rule all:
input:
Expand Down Expand Up @@ -88,11 +107,13 @@ rule create_project:
project_created=touch("results/project_created"),
params:
log_level=LOG_LEVEL,
test_flag="--test" if SUBMIT_TO_ENA_DEV else "",
shell:
"""
python {input.script} \
--config-file {input.config} \
--log-level {params.log_level} \
{params.test_flag}
"""


Expand All @@ -104,11 +125,13 @@ rule create_sample:
sample_created=touch("results/sample_created"),
params:
log_level=LOG_LEVEL,
test_flag="--test" if SUBMIT_TO_ENA_DEV else "",
shell:
"""
python {input.script} \
--config-file {input.config} \
--log-level {params.log_level} \
{params.test_flag}
"""


Expand All @@ -120,11 +143,13 @@ rule create_assembly:
sample_created=touch("results/assembly_created"),
params:
log_level=LOG_LEVEL,
test_flag="--test" if SUBMIT_TO_ENA_DEV else "",
shell:
"""
python {input.script} \
--config-file {input.config} \
--log-level {params.log_level} \
{params.test_flag}
"""


Expand Down
4 changes: 1 addition & 3 deletions ena-submission/config/defaults.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -6,9 +6,7 @@ db_name: Loculus
unique_project_suffix: Loculus
ena_submission_username: fake-user
ena_submission_password: fake-password
ena_submission_url: https://wwwdev.ebi.ac.uk/ena/submit/drop-box/submit # TODO(https://github.com/loculus-project/loculus/issues/2425): update in production
github_url: https://raw.githubusercontent.com/pathoplexus/ena-submission/main/test/approved_ena_submission_list.json # TODO(https://github.com/loculus-project/loculus/issues/2425): update in production
ena_reports_service_url: https://wwwdev.ebi.ac.uk/ena/submit/report # TODO(https://github.com/loculus-project/loculus/issues/2425): update in production
submit_to_ena_prod: False # TODO(https://github.com/loculus-project/loculus/issues/2425): update in production
#ena_checklist: ERC000033 - do not use until all fields are mapped to ENA accepted options
metadata_mapping:
'subject exposure':
Expand Down
73 changes: 57 additions & 16 deletions ena-submission/scripts/create_assembly.py
Original file line number Diff line number Diff line change
Expand Up @@ -79,16 +79,17 @@ def create_chromosome_list_object(

entries: list[AssemblyChromosomeListFileObject] = []

if len(unaligned_sequences.keys()) > 1:
for segment_name, item in unaligned_sequences.items():
if item: # Only list sequenced segments
entry = AssemblyChromosomeListFileObject(
object_name=f"{seq_key["accession"]}.{seq_key["version"]}_{segment_name}",
chromosome_name=segment_name,
chromosome_type=chromosome_type,
)
entries.append(entry)
else:
segment_order = get_segment_order(unaligned_sequences)

for segment_name in segment_order:
if segment_name != "main":
entry = AssemblyChromosomeListFileObject(
object_name=f"{seq_key["accession"]}.{seq_key["version"]}_{segment_name}",
chromosome_name=segment_name,
chromosome_type=chromosome_type,
)
entries.append(entry)
continue
entry = AssemblyChromosomeListFileObject(
object_name=f"{seq_key["accession"]}.{seq_key["version"]}",
chromosome_name="main",
Expand All @@ -99,6 +100,17 @@ def create_chromosome_list_object(
return AssemblyChromosomeListFile(chromosomes=entries)


def get_segment_order(unaligned_sequences) -> list[str]:
segment_order = []
if len(unaligned_sequences.keys()) > 1:
for segment_name, item in unaligned_sequences.items():
if item: # Only list sequenced segments
segment_order.append(segment_name)
else:
segment_order.append("main")
return sorted(segment_order)


def create_manifest_object(
config: Config,
sample_table_entry: dict[str, str],
Expand All @@ -108,6 +120,17 @@ def create_manifest_object(
group_key: dict[str, str],
test=False,
) -> AssemblyManifest:
"""
Create an AssemblyManifest object for an entry in the assembly table using:
- the corresponding ena_sample_accession and bioproject_accession
- the organism metadata from the config file
- sequencing metadata from the corresponding submission table entry
- unaligned nucleotide sequences from the corresponding submission table entry,
these are used to create chromosome files and fasta files which are passed to the manifest.
If test=True add a timestamp to the alias suffix to allow for multiple submissions of the same
manifest for testing.
"""
sample_accession = sample_table_entry["result"]["ena_sample_accession"]
study_accession = project_table_entry["result"]["bioproject_accession"]

Expand Down Expand Up @@ -264,13 +287,18 @@ def submission_table_update(db_config: SimpleConnectionPool):
raise RuntimeError(error_msg)


def assembly_table_create(db_config: SimpleConnectionPool, config: Config, retry_number: int = 3):
def assembly_table_create(
db_config: SimpleConnectionPool, config: Config, retry_number: int = 3, test: bool = False
):
"""
1. Find all entries in assembly_table in state READY
2. Create temporary files: chromosome_list_file, fasta_file, manifest_file
3. Update assembly_table to state SUBMITTING (only proceed if update succeeds)
4. If (create_ena_assembly succeeds): update state to SUBMITTED with results
3. Else update state to HAS_ERRORS with error messages
If test=True: add a timestamp to the alias suffix to allow for multiple submissions of the same
manifest for testing AND use the test ENA webin-cli endpoint for submission.
"""
ena_config = get_ena_config(
config.ena_submission_username,
Expand Down Expand Up @@ -321,7 +349,7 @@ def assembly_table_create(db_config: SimpleConnectionPool, config: Config, retry
sample_data_in_submission_table[0],
seq_key,
group_key,
test=True, # TODO(https://github.com/loculus-project/loculus/issues/2425): remove in production
test,
)
manifest_file = create_manifest(manifest_object)

Expand All @@ -340,10 +368,14 @@ def assembly_table_create(db_config: SimpleConnectionPool, config: Config, retry
)
continue
logger.info(f"Starting assembly creation for accession {row["accession"]}")
segment_order = get_segment_order(
sample_data_in_submission_table[0]["unaligned_nucleotide_sequences"]
)
assembly_creation_results: CreationResults = create_ena_assembly(
ena_config, manifest_file, center_name=center_name
ena_config, manifest_file, center_name=center_name, test=test
)
if assembly_creation_results.results:
assembly_creation_results.results["segment_order"] = segment_order
update_values = {
"status": Status.WAITING,
"result": json.dumps(assembly_creation_results.results),
Expand Down Expand Up @@ -416,7 +448,10 @@ def assembly_table_update(
logger.debug("Checking state in ENA")
for row in waiting:
seq_key = {"accession": row["accession"], "version": row["version"]}
check_results: CreationResults = check_ena(ena_config, row["result"]["erz_accession"])
segment_order = row["result"]["segment_order"]
check_results: CreationResults = check_ena(
ena_config, row["result"]["erz_accession"], segment_order
)
_last_ena_check = time
if not check_results.results:
continue
Expand Down Expand Up @@ -502,7 +537,13 @@ def assembly_table_handle_errors(
required=True,
type=click.Path(exists=True),
)
def create_assembly(log_level, config_file):
@click.option(
"--test",
is_flag=True,
default=False,
help="Allow multiple submissions of the same project for testing AND use the webin-cli test endpoint",
)
def create_assembly(log_level, config_file, test=False):
logger.setLevel(log_level)
logging.getLogger("requests").setLevel(logging.INFO)

Expand All @@ -523,7 +564,7 @@ def create_assembly(log_level, config_file):
submission_table_start(db_config)
submission_table_update(db_config)

assembly_table_create(db_config, config, retry_number=3)
assembly_table_create(db_config, config, retry_number=3, test=test)
assembly_table_update(db_config, config)
assembly_table_handle_errors(db_config, config, slack_config)
time.sleep(2)
Expand Down
Loading

0 comments on commit b264449

Please sign in to comment.