Skip to content

Commit

Permalink
Merge pull request #53 from nextstrain/update-curate
Browse files Browse the repository at this point in the history
ingest: use new augur curate commands
  • Loading branch information
joverlee521 authored Jul 15, 2024
2 parents 06e6f27 + 07886c9 commit 40cdf39
Show file tree
Hide file tree
Showing 16 changed files with 113 additions and 552 deletions.
2 changes: 2 additions & 0 deletions ingest/defaults/config.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -73,6 +73,8 @@ curate:
# These date formats should use directives expected by datetime
# See https://docs.python.org/3.9/library/datetime.html#strftime-and-strptime-format-codes
expected_date_formats: ["%Y", "%Y-%m", "%Y-%m-%d", "%Y-%m-%dT%H:%M:%SZ"]
# The expected field that contains the GenBank geo_loc_name
genbank_location_field: location
titlecase:
# List of string fields to titlecase
fields: ["region", "country", "division", "location"]
Expand Down
15 changes: 8 additions & 7 deletions ingest/rules/curate.smk
Original file line number Diff line number Diff line change
Expand Up @@ -73,6 +73,7 @@ rule curate:
strain_backup_fields=config["curate"]["strain_backup_fields"],
date_fields=config["curate"]["date_fields"],
expected_date_formats=config["curate"]["expected_date_formats"],
genbank_location_field=config["curate"]["genbank_location_field"],
articles=config["curate"]["titlecase"]["articles"],
abbreviations=config["curate"]["titlecase"]["abbreviations"],
titlecase_fields=config["curate"]["titlecase"]["fields"],
Expand All @@ -85,30 +86,30 @@ rule curate:
shell:
"""
(cat {input.sequences_ndjson} \
| ./vendored/transform-field-names \
| augur curate rename \
--field-map {params.field_map} \
| augur curate normalize-strings \
| ./vendored/transform-strain-names \
| augur curate transform-strain-name \
--strain-regex {params.strain_regex} \
--backup-fields {params.strain_backup_fields} \
| augur curate format-dates \
--date-fields {params.date_fields} \
--expected-date-formats {params.expected_date_formats} \
| ./vendored/transform-genbank-location \
| augur curate parse-genbank-location \
--location-field {params.genbank_location_field} \
| augur curate titlecase \
--titlecase-fields {params.titlecase_fields} \
--articles {params.articles} \
--abbreviations {params.abbreviations} \
| ./vendored/transform-authors \
| augur curate abbreviate-authors \
--authors-field {params.authors_field} \
--default-value {params.authors_default_value} \
--abbr-authors-field {params.abbr_authors_field} \
| ./vendored/apply-geolocation-rules \
| augur curate apply-geolocation-rules \
--geolocation-rules {input.all_geolocation_rules} \
| ./vendored/merge-user-metadata \
| augur curate apply-record-annotations \
--annotations {input.annotations} \
--id-field {params.annotations_id} \
| augur curate passthru \
--output-metadata {output.metadata} \
--output-fasta {output.sequences} \
--output-id-field {params.id_field} \
Expand Down
3 changes: 0 additions & 3 deletions ingest/vendored/.cramrc

This file was deleted.

17 changes: 17 additions & 0 deletions ingest/vendored/.github/dependabot.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,17 @@
# Dependabot configuration file
# <https://docs.github.com/en/code-security/dependabot/dependabot-version-updates/configuration-options-for-the-dependabot.yml-file>
#
# Each ecosystem is checked on a scheduled interval defined below. To trigger
# a check manually, go to
#
# https://github.com/nextstrain/ingest/network/updates
#
# and look for a "Check for updates" button. You may need to click around a
# bit first.
---
version: 2
updates:
- package-ecosystem: "github-actions"
directory: "/"
schedule:
interval: "weekly"
10 changes: 1 addition & 9 deletions ingest/vendored/.github/workflows/ci.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -11,13 +11,5 @@ jobs:
shellcheck:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v3
- uses: actions/checkout@v4
- uses: nextstrain/.github/actions/shellcheck@master

cram:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v3
- uses: actions/setup-python@v4
- run: pip install cram
- run: cram tests/
14 changes: 14 additions & 0 deletions ingest/vendored/.github/workflows/pre-commit.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,14 @@
name: pre-commit

on:
- push

jobs:
pre-commit:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: actions/setup-python@v5
with:
python-version: "3.12"
- uses: pre-commit/action@v3.0.1
4 changes: 2 additions & 2 deletions ingest/vendored/.gitrepo
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@
[subrepo]
remote = https://github.com/nextstrain/ingest
branch = main
commit = 7617c39fae05e5882c5e6c065c5b47d500c998af
parent = d49aeeb0a95f6336a516b3f25fa83a67d16bf56e
commit = 258ab8ce898a88089bc88caee336f8d683a0e79a
parent = b1e777a8e79748bb69e24feeb0a1bd8a29db4969
method = merge
cmdver = 0.4.6
40 changes: 40 additions & 0 deletions ingest/vendored/.pre-commit-config.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,40 @@
default_language_version:
python: python3
repos:
- repo: https://github.com/pre-commit/sync-pre-commit-deps
rev: v0.0.1
hooks:
- id: sync-pre-commit-deps
- repo: https://github.com/shellcheck-py/shellcheck-py
rev: v0.10.0.1
hooks:
- id: shellcheck
- repo: https://github.com/rhysd/actionlint
rev: v1.6.27
hooks:
- id: actionlint
entry: env SHELLCHECK_OPTS='--exclude=SC2027' actionlint
- repo: https://github.com/pre-commit/pre-commit-hooks
rev: v4.6.0
hooks:
- id: trailing-whitespace
- id: check-ast
- id: check-case-conflict
- id: check-docstring-first
- id: check-json
- id: check-executables-have-shebangs
- id: check-merge-conflict
- id: check-shebang-scripts-are-executable
- id: check-symlinks
- id: check-toml
- id: check-yaml
- id: destroyed-symlinks
- id: detect-private-key
- id: end-of-file-fixer
- id: fix-byte-order-marker
- repo: https://github.com/astral-sh/ruff-pre-commit
# Ruff version.
rev: v0.4.6
hooks:
# Run the linter.
- id: ruff
47 changes: 29 additions & 18 deletions ingest/vendored/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@

Shared internal tooling for pathogen data ingest. Used by our individual
pathogen repos which produce Nextstrain builds. Expected to be vendored by
each pathogen repo using `git subtree`.
each pathogen repo using `git subrepo`.

Some tools may only live here temporarily before finding a permanent home in
`augur curate` or Nextstrain CLI. Others may happily live out their days here.
Expand All @@ -12,6 +12,9 @@ Some tools may only live here temporarily before finding a permanent home in
Nextstrain maintained pathogen repos will use [`git subrepo`](https://github.com/ingydotnet/git-subrepo) to vendor ingest scripts.
(See discussion on this decision in https://github.com/nextstrain/ingest/issues/3)

For a list of Nextstrain repos that are currently using this method, use [this
GitHub code search](https://github.com/search?type=code&q=org%3Anextstrain+subrepo+%22remote+%3D+https%3A%2F%2Fgithub.com%2Fnextstrain%2Fingest%22).

If you don't already have `git subrepo` installed, follow the [git subrepo installation instructions](https://github.com/ingydotnet/git-subrepo#installation).
Then add the latest ingest scripts to the pathogen repo by running:

Expand Down Expand Up @@ -54,14 +57,14 @@ commit hash if needed.

Much of this tooling originated in
[ncov-ingest](https://github.com/nextstrain/ncov-ingest) and was passaged thru
[monkeypox's ingest/](https://github.com/nextstrain/monkeypox/tree/@/ingest/).
It subsequently proliferated from [monkeypox][] to other pathogen repos
([rsv][], [zika][], [dengue][], [hepatitisB][], [forecasts-ncov][]) primarily
thru copying. To [counter that
[mpox's ingest/](https://github.com/nextstrain/mpox/tree/@/ingest/). It
subsequently proliferated from [mpox][] to other pathogen repos ([rsv][],
[zika][], [dengue][], [hepatitisB][], [forecasts-ncov][]) primarily thru
copying. To [counter that
proliferation](https://bedfordlab.slack.com/archives/C7SDVPBLZ/p1688577879947079),
this repo was made.

[monkeypox]: https://github.com/nextstrain/monkeypox
[mpox]: https://github.com/nextstrain/mpox
[rsv]: https://github.com/nextstrain/rsv
[zika]: https://github.com/nextstrain/zika/pull/24
[dengue]: https://github.com/nextstrain/dengue/pull/10
Expand Down Expand Up @@ -114,15 +117,6 @@ Potential Nextstrain CLI scripts
- [download-from-s3](download-from-s3) - Download file from AWS S3 bucket with decompression based on file extension in S3 URL.
Skips download if the local file already exists and has a hash identical to the S3 object's metadata `sha256sum`.

Potential augur curate scripts

- [apply-geolocation-rules](apply-geolocation-rules) - Applies user curated geolocation rules to NDJSON records
- [merge-user-metadata](merge-user-metadata) - Merges user annotations with NDJSON records
- [transform-authors](transform-authors) - Abbreviates full author lists to '<first author> et al.'
- [transform-field-names](transform-field-names) - Rename fields of NDJSON records
- [transform-genbank-location](transform-genbank-location) - Parses `location` field with the expected pattern `"<country_value>[:<region>][, <locality>]"` based on [GenBank's country field](https://www.ncbi.nlm.nih.gov/genbank/collab/country/)
- [transform-strain-names](transform-strain-names) - Ordered search for strain names across several fields.

## Software requirements

Some scripts may require Bash ≥4. If you are running these scripts on macOS, the builtin Bash (`/bin/bash`) does not meet this requirement. You can install [Homebrew's Bash](https://formulae.brew.sh/formula/bash) which is more up to date.
Expand All @@ -131,7 +125,24 @@ Some scripts may require Bash ≥4. If you are running these scripts on macOS, t

Most scripts are untested within this repo, relying on "testing in production". That is the only practical testing option for some scripts such as the ones interacting with S3 and Slack.

For more locally testable scripts, Cram-style functional tests live in `tests` and are run as part of CI. To run these locally,
## Working on this repo

This repo is configured to use [pre-commit](https://pre-commit.com),
to help automatically catch common coding errors and syntax issues
with changes before they are committed to the repo.

If you will be writing new code or otherwise working within this repo,
please do the following to get started:

1. [install `pre-commit`](https://pre-commit.com/#install) by running
either `python -m pip install pre-commit` or `brew install
pre-commit`, depending on your preferred package management
solution
2. install the local git hooks by running `pre-commit install` from
the root of the repo
3. when problems are detected, correct them in your local working tree
before committing them.

1. Download Cram: `pip install cram`
2. Run the tests: `cram tests/`
Note that these pre-commit checks are also run in a GitHub Action when
changes are pushed to GitHub, so correcting issues locally will
prevent extra cycles of correction.
Loading

0 comments on commit 40cdf39

Please sign in to comment.