Skip to content

Commit

Permalink
Merge pull request #11: Sync vendored repo, update usage docs
Browse files Browse the repository at this point in the history
  • Loading branch information
victorlin authored Oct 17, 2023
2 parents b0ee149 + 3e17382 commit 7bd1b05
Show file tree
Hide file tree
Showing 18 changed files with 141 additions and 26 deletions.
12 changes: 3 additions & 9 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -29,16 +29,10 @@ As of mid 2023 there are around ~11k genomes and the full GenBank file is ~150Mb

### `ingest/vendored`

This repository uses [`git subrepo`](https://github.com/ingydotnet/git-subrepo) to manage copies of ingest scripts in `ingest/vendored`, from [nextstrain/ingest](https://github.com/nextstrain/ingest). To pull new changes from the central ingest repository, run:
This repository uses [`git subrepo`](https://github.com/ingydotnet/git-subrepo) to manage copies of ingest scripts in [`ingest/vendored`](./ingest/vendored), from [nextstrain/ingest](https://github.com/nextstrain/ingest). To pull new changes from the central ingest repository, run:

```sh
git subrepo pull ingest/vendored
```

Changes should not be pushed using `git subrepo push`.

1. For pathogen-specific changes, make them in this repository via a pull request.
2. For pathogen-agnostic changes, make them on [nextstrain/ingest](https://github.com/nextstrain/ingest) via pull request there, then use `git subrepo pull` to add those changes to this repository.
See [ingest/vendored/README.md](./ingest/vendored/README.md#vendoring) for instructions on how to update
the vendored scripts.

## Phylo

Expand Down
3 changes: 3 additions & 0 deletions ingest/vendored/.cramrc
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
[cram]
shell = /bin/bash
indent = 2
16 changes: 13 additions & 3 deletions ingest/vendored/.github/workflows/ci.yaml
Original file line number Diff line number Diff line change
@@ -1,13 +1,23 @@
name: CI

on:
- push
- pull_request
- workflow_dispatch
push:
branches:
- main
pull_request:
workflow_dispatch:

jobs:
shellcheck:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v3
- uses: nextstrain/.github/actions/shellcheck@master

cram:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v3
- uses: actions/setup-python@v4
- run: pip install cram
- run: cram tests/
4 changes: 2 additions & 2 deletions ingest/vendored/.gitrepo
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@
[subrepo]
remote = https://github.com/nextstrain/ingest
branch = main
commit = d141c04ac38796cd26366207f43454f75b3d638b
parent = 1f38f623d493bbafc25a4ab60226a4bd59ef8f6d
commit = 7617c39fae05e5882c5e6c065c5b47d500c998af
parent = b0ee1497a4d1f471ec87529fd7913fb96fbdf032
method = merge
cmdver = 0.4.6
43 changes: 42 additions & 1 deletion ingest/vendored/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -25,6 +25,31 @@ Any future updates of ingest scripts can be pulled in with:
git subrepo pull ingest/vendored
```

If you run into merge conflicts and would like to pull in a fresh copy of the
latest ingest scripts, pull with the `--force` flag:

```
git subrepo pull ingest/vendored --force
```

> **Warning**
> Beware of rebasing/dropping the parent commit of a `git subrepo` update
`git subrepo` relies on metadata in the `ingest/vendored/.gitrepo` file,
which includes the hash for the parent commit in the pathogen repos.
If this hash no longer exists in the commit history, there will be errors when
running future `git subrepo pull` commands.

If you run into an error similar to the following:
```
$ git subrepo pull ingest/vendored
git-subrepo: Command failed: 'git branch subrepo/ingest/vendored '.
fatal: not a valid object name: ''
```
Check the parent commit hash in the `ingest/vendored/.gitrepo` file and make
sure the commit exists in the commit history. Update to the appropriate parent
commit hash if needed.

## History

Much of this tooling originated in
Expand Down Expand Up @@ -72,7 +97,9 @@ Scripts for supporting ingest workflow automation that don’t really belong in
NCBI interaction scripts that are useful for fetching public metadata and sequences.

- [fetch-from-ncbi-entrez](fetch-from-ncbi-entrez) - Fetch metadata and nucleotide sequences from [NCBI Entrez](https://www.ncbi.nlm.nih.gov/books/NBK25501/) and output to a GenBank file.
Useful for pathogens with metadata and annotations in custom fields that are not part of the standard [NCBI Virus](https://www.ncbi.nlm.nih.gov/labs/virus/vssi/) or [NCBI Datasets](https://www.ncbi.nlm.nih.gov/datasets/) outputs.
Useful for pathogens with metadata and annotations in custom fields that are not part of the standard [NCBI Datasets](https://www.ncbi.nlm.nih.gov/datasets/) outputs.

Historically, some pathogen repos used the undocumented NCBI Virus API through [fetch-from-ncbi-virus](https://github.com/nextstrain/ingest/blob/c97df238518171c2b1574bec0349a55855d1e7a7/fetch-from-ncbi-virus) to fetch data. However we've opted to drop the NCBI Virus scripts due to https://github.com/nextstrain/ingest/issues/18.

Potential Nextstrain CLI scripts

Expand All @@ -94,3 +121,17 @@ Potential augur curate scripts
- [transform-authors](transform-authors) - Abbreviates full author lists to '<first author> et al.'
- [transform-field-names](transform-field-names) - Rename fields of NDJSON records
- [transform-genbank-location](transform-genbank-location) - Parses `location` field with the expected pattern `"<country_value>[:<region>][, <locality>]"` based on [GenBank's country field](https://www.ncbi.nlm.nih.gov/genbank/collab/country/)
- [transform-strain-names](transform-strain-names) - Ordered search for strain names across several fields.

## Software requirements

Some scripts may require Bash ≥4. If you are running these scripts on macOS, the builtin Bash (`/bin/bash`) does not meet this requirement. You can install [Homebrew's Bash](https://formulae.brew.sh/formula/bash) which is more up to date.

## Testing

Most scripts are untested within this repo, relying on "testing in production". That is the only practical testing option for some scripts such as the ones interacting with S3 and Slack.

For more locally testable scripts, Cram-style functional tests live in `tests` and are run as part of CI. To run these locally,

1. Download Cram: `pip install cram`
2. Run the tests: `cram tests/`
2 changes: 1 addition & 1 deletion ingest/vendored/cloudfront-invalidate
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
#!/bin/bash
#!/usr/bin/env bash
# Originally from @tsibley's gist: https://gist.github.com/tsibley/a66262d341dedbea39b02f27e2837ea8
set -euo pipefail

Expand Down
2 changes: 1 addition & 1 deletion ingest/vendored/download-from-s3
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
#!/bin/bash
#!/usr/bin/env bash
set -euo pipefail

bin="$(dirname "$0")"
Expand Down
2 changes: 1 addition & 1 deletion ingest/vendored/notify-on-diff
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
#!/bin/bash
#!/usr/bin/env bash

set -euo pipefail

Expand Down
2 changes: 1 addition & 1 deletion ingest/vendored/notify-on-job-fail
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
#!/bin/bash
#!/usr/bin/env bash
set -euo pipefail

: "${SLACK_TOKEN:?The SLACK_TOKEN environment variable is required.}"
Expand Down
2 changes: 1 addition & 1 deletion ingest/vendored/notify-on-job-start
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
#!/bin/bash
#!/usr/bin/env bash
set -euo pipefail

: "${SLACK_TOKEN:?The SLACK_TOKEN environment variable is required.}"
Expand Down
2 changes: 1 addition & 1 deletion ingest/vendored/notify-on-record-change
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
#!/bin/bash
#!/usr/bin/env bash
set -euo pipefail

: "${SLACK_TOKEN:?The SLACK_TOKEN environment variable is required.}"
Expand Down
2 changes: 1 addition & 1 deletion ingest/vendored/notify-slack
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
#!/bin/bash
#!/usr/bin/env bash
set -euo pipefail

: "${SLACK_TOKEN:?The SLACK_TOKEN environment variable is required.}"
Expand Down
2 changes: 1 addition & 1 deletion ingest/vendored/s3-object-exists
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
#!/bin/bash
#!/usr/bin/env bash
set -euo pipefail

url="${1#s3://}"
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,17 @@
Look for strain name in "strain" or a list of backup fields.

If strain entry exists, do not do anything.

$ echo '{"strain": "i/am/a/strain", "strain_s": "other"}' \
> | $TESTDIR/../../transform-strain-names \
> --strain-regex '^.+$' \
> --backup-fields strain_s accession
{"strain":"i/am/a/strain","strain_s":"other"}

If strain entry does not exists, search the backup fields

$ echo '{"strain_s": "other"}' \
> | $TESTDIR/../../transform-strain-names \
> --strain-regex '^.+$' \
> --backup-fields accession strain_s
{"strain_s":"other","strain":"other"}
50 changes: 50 additions & 0 deletions ingest/vendored/transform-strain-names
Original file line number Diff line number Diff line change
@@ -0,0 +1,50 @@
#!/usr/bin/env python3
"""
Verifies strain name pattern in the 'strain' field of the NDJSON record from
stdin. Adds a 'strain' field to the record if it does not already exist.
Outputs the modified records to stdout.
"""
import argparse
import json
import re
from sys import stderr, stdin, stdout


if __name__ == '__main__':
parser = argparse.ArgumentParser(
description=__doc__,
formatter_class=argparse.ArgumentDefaultsHelpFormatter
)
parser.add_argument("--strain-regex", default="^.+$",
help="Regex pattern for strain names. " +
"Strain names that do not match the pattern will be dropped.")
parser.add_argument("--backup-fields", nargs="*",
help="List of backup fields to use as strain name if the value in 'strain' " +
"does not match the strain regex pattern. " +
"If multiple fields are provided, will use the first field that has a non-empty string.")

args = parser.parse_args()

strain_name_pattern = re.compile(args.strain_regex)

for index, record in enumerate(stdin):
record = json.loads(record)

# Verify strain name matches the strain regex pattern
if strain_name_pattern.match(record.get('strain', '')) is None:
# Default to empty string if not matching pattern
record['strain'] = ''
# Use non-empty value of backup fields if provided
if args.backup_fields:
for field in args.backup_fields:
if record.get(field):
record['strain'] = str(record[field])
break

if record['strain'] == '':
print(f"WARNING: Record number {index} has an empty string as the strain name.", file=stderr)


json.dump(record, stdout, allow_nan=False, indent=None, separators=',:')
print()
2 changes: 1 addition & 1 deletion ingest/vendored/trigger
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
#!/bin/bash
#!/usr/bin/env bash
set -euo pipefail

: "${PAT_GITHUB_DISPATCH:=}"
Expand Down
2 changes: 1 addition & 1 deletion ingest/vendored/trigger-on-new-data
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
#!/bin/bash
#!/usr/bin/env bash
set -euo pipefail

: "${PAT_GITHUB_DISPATCH:?The PAT_GITHUB_DISPATCH environment variable is required.}"
Expand Down
2 changes: 1 addition & 1 deletion ingest/vendored/upload-to-s3
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
#!/bin/bash
#!/usr/bin/env bash
set -euo pipefail

bin="$(dirname "$0")"
Expand Down

0 comments on commit 7bd1b05

Please sign in to comment.