-
Notifications
You must be signed in to change notification settings - Fork 1
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
git subrepo clone (merge) https://github.com/nextstrain/ingest ingest…
…/vendored subrepo: subdir: "ingest/vendored" merged: "5d90818" upstream: origin: "https://github.com/nextstrain/ingest" branch: "main" commit: "5d90818" git-subrepo: version: "0.4.6" origin: "https://github.com/ingydotnet/git-subrepo" commit: "110b9eb"
- Loading branch information
Showing
22 changed files
with
1,057 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,16 @@ | ||
### Description of proposed changes | ||
|
||
<!-- What is the goal of this pull request? What does this pull request change? --> | ||
|
||
### Related issue(s) | ||
|
||
<!-- Link any related issues here. --> | ||
|
||
### Checklist | ||
|
||
<!-- Make sure checks are successful at the bottom of the PR. --> | ||
|
||
- [ ] Checks pass | ||
- [ ] If adding a script, add an entry for it in the README. | ||
|
||
<!-- 🙌 Thank you for contributing to Nextstrain! ✨ --> |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,13 @@ | ||
name: CI | ||
|
||
on: | ||
- push | ||
- pull_request | ||
- workflow_dispatch | ||
|
||
jobs: | ||
shellcheck: | ||
runs-on: ubuntu-latest | ||
steps: | ||
- uses: actions/checkout@v3 | ||
- uses: nextstrain/.github/actions/shellcheck@master |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,12 @@ | ||
; DO NOT EDIT (unless you know what you are doing) | ||
; | ||
; This subdirectory is a git "subrepo", and this file is maintained by the | ||
; git-subrepo command. See https://github.com/ingydotnet/git-subrepo#readme | ||
; | ||
[subrepo] | ||
remote = https://github.com/nextstrain/ingest | ||
branch = main | ||
commit = 5d908187d13cbce27253af5163b4803d6bac03a6 | ||
parent = 1cdd1970924fed3fca67db43e8cb8f3de010ec37 | ||
method = merge | ||
cmdver = 0.4.6 |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,6 @@ | ||
# Use of this file requires Shellcheck v0.7.0 or newer. | ||
# | ||
# SC2064 - We intentionally want variables to expand immediately within traps | ||
# so the trap can not fail due to variable interpolation later. | ||
# | ||
disable=SC2064 |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,91 @@ | ||
# ingest | ||
|
||
Shared internal tooling for pathogen data ingest. Used by our individual | ||
pathogen repos which produce Nextstrain builds. Expected to be vendored by | ||
each pathogen repo using `git subtree`. | ||
|
||
Some tools may only live here temporarily before finding a permanent home in | ||
`augur curate` or Nextstrain CLI. Others may happily live out their days here. | ||
|
||
## Vendoring | ||
|
||
Nextstrain maintained pathogen repos will use [`git subrepo`](https://github.com/ingydotnet/git-subrepo) to vendor ingest scripts. | ||
(See discussion on this decision in https://github.com/nextstrain/ingest/issues/3) | ||
|
||
If you don't already have `git subrepo` installed, follow the [git subrepo installation instructions](https://github.com/ingydotnet/git-subrepo#installation). | ||
Then add the latest ingest scripts to the pathogen repo by running: | ||
|
||
``` | ||
git subrepo clone https://github.com/nextstrain/ingest ingest/vendored | ||
``` | ||
|
||
Any future updates of ingest scripts can be pulled in with: | ||
|
||
``` | ||
git subrepo pull ingest/vendored | ||
``` | ||
|
||
## History | ||
|
||
Much of this tooling originated in | ||
[ncov-ingest](https://github.com/nextstrain/ncov-ingest) and was passaged thru | ||
[monkeypox's ingest/](https://github.com/nextstrain/monkeypox/tree/@/ingest/). | ||
It subsequently proliferated from [monkeypox][] to other pathogen repos | ||
([rsv][], [zika][], [dengue][], [hepatitisB][], [forecasts-ncov][]) primarily | ||
thru copying. To [counter that | ||
proliferation](https://bedfordlab.slack.com/archives/C7SDVPBLZ/p1688577879947079), | ||
this repo was made. | ||
|
||
[monkeypox]: https://github.com/nextstrain/monkeypox | ||
[rsv]: https://github.com/nextstrain/rsv | ||
[zika]: https://github.com/nextstrain/zika/pull/24 | ||
[dengue]: https://github.com/nextstrain/dengue/pull/10 | ||
[hepatitisB]: https://github.com/nextstrain/hepatitisB | ||
[forecasts-ncov]: https://github.com/nextstrain/forecasts-ncov | ||
|
||
## Elsewhere | ||
|
||
The creation of this repo, in both the abstract and concrete, and the general | ||
approach to "ingest" has been discussed in various internal places, including: | ||
|
||
- https://github.com/nextstrain/private/issues/59 | ||
- @joverlee521's [workflows document](https://docs.google.com/document/d/1rLWPvEuj0Ayc8MR0O1lfRJZfj9av53xU38f20g8nU_E/edit#heading=h.4g0d3mjvb89i) | ||
- [5 July 2023 Slack thread](https://bedfordlab.slack.com/archives/C7SDVPBLZ/p1688577879947079) | ||
- [6 July 2023 team meeting](https://docs.google.com/document/d/1FPfx-ON5RdqL2wyvODhkrCcjgOVX3nlXgBwCPhIEsco/edit) | ||
- _…many others_ | ||
|
||
## Scripts | ||
|
||
Scripts for supporting ingest workflow automation that don’t really belong in any of our existing tools. | ||
|
||
- [notify-on-diff](notify-on-diff) - Send Slack message with diff of a local file and an S3 object | ||
- [notify-on-job-fail](notify-on-job-fail) - Send Slack message with details about failed workflow job on GitHub Actions and/or AWS Batch | ||
- [notify-on-job-start](notify-on-job-start) - Send Slack message with details about workflow job on GitHub Actions and/or AWS Batch | ||
- [notify-on-record-change](notify-on-recod-change) - Send Slack message with details about line count changes for a file compared to an S3 object's metadata `recordcount`. | ||
If the S3 object's metadata does not have `recordcount`, then will attempt to download S3 object to count lines locally, which only supports `xz` compressed S3 objects. | ||
- [notify-slack](notify-slack) - Send message or file to Slack | ||
- [s3-object-exists](s3-object-exists) - Used to prevent 404 errors during S3 file comparisons in the notify-* scripts | ||
- [trigger](trigger) - Triggers downstream GitHub Actions via the GitHub API using repository_dispatch events. | ||
- [trigger-on-new-data](trigger-on-new-data) - Triggers downstream GitHub Actions if the provided `upload-to-s3` outputs do not contain the `identical_file_message` | ||
A hacky way to ensure that we only trigger downstream phylogenetic builds if the S3 objects have been updated. | ||
|
||
Potential Nextstrain CLI scripts | ||
|
||
- [sha256sum](sha256sum) - Used to check if files are identical in upload-to-s3 and download-from-s3 scripts. | ||
- [cloudfront-invalidate](cloudfront-invalidate) - CloudFront invalidation is already supported in the [nextstrain remote command for S3 files](https://github.com/nextstrain/cli/blob/a5dda9c0579ece7acbd8e2c32a4bbe95df7c0bce/nextstrain/cli/remote/s3.py#L104). | ||
This exists as a separate script to support CloudFront invalidation when using the upload-to-s3 script. | ||
- [upload-to-s3](upload-to-s3) - Upload file to AWS S3 bucket with compression based on file extension in S3 URL. | ||
Skips upload if the local file's hash is identical to the S3 object's metadata `sha256sum`. | ||
Adds the following user defined metadata to uploaded S3 object: | ||
- `sha256sum` - hash of the file generated by [sha256sum](sha256sum) | ||
- `recordcount` - the line count of the file | ||
- [download-from-s3](download-from-s3) - Download file from AWS S3 bucket with decompression based on file extension in S3 URL. | ||
Skips download if the local file already exists and has a hash identical to the S3 object's metadata `sha256sum`. | ||
|
||
Potential augur curate scripts | ||
|
||
- [apply-geolocation-rules](apply-geolocation-rules) - Applies user curated geolocation rules to NDJSON records | ||
- [merge-user-metadata](merge-user-metadata) - Merges user annotations with NDJSON records | ||
- [transform-authors](transform-authors) - Abbreviates full author lists to '<first author> et al.' | ||
- [transform-field-names](transform-field-names) - Rename fields of NDJSON records | ||
- [transform-genbank-location](transform-genbank-location) - Parses `location` field with the expected pattern `"<country_value>[:<region>][, <locality>]"` based on [GenBank's country field](https://www.ncbi.nlm.nih.gov/genbank/collab/country/) |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,234 @@ | ||
#!/usr/bin/env python3 | ||
""" | ||
Applies user curated geolocation rules to the geolocation fields in the NDJSON | ||
records from stdin. The modified records are output to stdout. This does not do | ||
any additional transformations on top of the user curations. | ||
""" | ||
import argparse | ||
import json | ||
from collections import defaultdict | ||
from sys import exit, stderr, stdin, stdout | ||
|
||
|
||
class CyclicGeolocationRulesError(Exception): | ||
pass | ||
|
||
|
||
def load_geolocation_rules(geolocation_rules_file): | ||
""" | ||
Loads the geolocation rules from the provided *geolocation_rules_file*. | ||
Returns the rules as a dict: | ||
{ | ||
regions: { | ||
countries: { | ||
divisions: { | ||
locations: corrected_geolocations_tuple | ||
} | ||
} | ||
} | ||
} | ||
""" | ||
geolocation_rules = defaultdict(lambda: defaultdict(lambda: defaultdict(dict))) | ||
with open(geolocation_rules_file, 'r') as rules_fh: | ||
for line in rules_fh: | ||
# ignore comments | ||
if line.strip()=="" or line.lstrip()[0] == '#': | ||
continue | ||
|
||
row = line.strip('\n').split('\t') | ||
# Skip lines that cannot be split into raw and annotated geolocations | ||
if len(row) != 2: | ||
print( | ||
f"WARNING: Could not decode geolocation rule {line!r}.", | ||
"Please make sure rules are formatted as", | ||
"'region/country/division/location<tab>region/country/division/location'.", | ||
file=stderr) | ||
continue | ||
|
||
# remove trailing comments | ||
row[-1] = row[-1].partition('#')[0].rstrip() | ||
raw , annot = tuple( row[0].split('/') ) , tuple( row[1].split('/') ) | ||
|
||
# Skip lines where raw or annotated geolocations cannot be split into 4 fields | ||
if len(raw) != 4: | ||
print( | ||
f"WARNING: Could not decode the raw geolocation {row[0]!r}.", | ||
"Please make sure it is formatted as 'region/country/division/location'.", | ||
file=stderr | ||
) | ||
continue | ||
|
||
if len(annot) != 4: | ||
print( | ||
f"WARNING: Could not decode the annotated geolocation {row[1]!r}.", | ||
"Please make sure it is formatted as 'region/country/division/location'.", | ||
file=stderr | ||
) | ||
continue | ||
|
||
|
||
geolocation_rules[raw[0]][raw[1]][raw[2]][raw[3]] = annot | ||
|
||
return geolocation_rules | ||
|
||
|
||
def get_annotated_geolocation(geolocation_rules, raw_geolocation, rule_traversal = None): | ||
""" | ||
Gets the annotated geolocation for the *raw_geolocation* in the provided | ||
*geolocation_rules*. | ||
Recursively traverses the *geolocation_rules* until we get the annotated | ||
geolocation, which must be a Tuple. Returns `None` if there are no | ||
applicable rules for the provided *raw_geolocation*. | ||
Rules are applied in the order of region, country, division, location. | ||
First checks the provided raw values for geolocation fields, then if there | ||
are not matches, tries to use general rules marked with '*'. | ||
""" | ||
# Always instantiate the rule traversal as an empty list if not provided, | ||
# e.g. the first call of this recursive function | ||
if rule_traversal is None: | ||
rule_traversal = [] | ||
|
||
current_rules = geolocation_rules | ||
# Traverse the geolocation rules based using the rule_traversal values | ||
for field_value in rule_traversal: | ||
current_rules = current_rules.get(field_value) | ||
# If we hit `None`, then we know there are no matching rules, so stop the rule traversal | ||
if current_rules is None: | ||
break | ||
|
||
# We've found the tuple of the annotated geolocation | ||
if isinstance(current_rules, tuple): | ||
return current_rules | ||
|
||
# We've reach the next level of geolocation rules, | ||
# so try to traverse the rules with the next target in raw_geolocation | ||
if isinstance(current_rules, dict): | ||
next_traversal_target = raw_geolocation[len(rule_traversal)] | ||
rule_traversal.append(next_traversal_target) | ||
return get_annotated_geolocation(geolocation_rules, raw_geolocation, rule_traversal) | ||
|
||
# We did not find any matching rule for the last traversal target | ||
if current_rules is None: | ||
# If we've used all general rules and we still haven't found a match, | ||
# then there are no applicable rules for this geolocation | ||
if all(value == '*' for value in rule_traversal): | ||
return None | ||
|
||
# If we failed to find matching rule with a general rule as the last | ||
# traversal target, then delete all trailing '*'s to reset rule_traversal | ||
# to end with the last index that is currently NOT a '*' | ||
# [A, *, B, *] => [A, *, B] | ||
# [A, B, *, *] => [A, B] | ||
# [A, *, *, *] => [A] | ||
if rule_traversal[-1] == '*': | ||
# Find the index of the first of the consecutive '*' from the | ||
# end of the rule_traversal | ||
# [A, *, B, *] => first_consecutive_general_rule_index = 3 | ||
# [A, B, *, *] => first_consecutive_general_rule_index = 2 | ||
# [A, *, *, *] => first_consecutive_general_rule_index = 1 | ||
for index, field_value in reversed(list(enumerate(rule_traversal))): | ||
if field_value == '*': | ||
first_consecutive_general_rule_index = index | ||
else: | ||
break | ||
|
||
rule_traversal = rule_traversal[:first_consecutive_general_rule_index] | ||
|
||
# Set the final value to '*' in hopes that by moving to a general rule, | ||
# we can find a matching rule. | ||
rule_traversal[-1] = '*' | ||
|
||
return get_annotated_geolocation(geolocation_rules, raw_geolocation, rule_traversal) | ||
|
||
|
||
def transform_geolocations(geolocation_rules, geolocation): | ||
""" | ||
Transform the provided *geolocation* by looking it up in the provided | ||
*geolocation_rules*. | ||
This will use all rules that apply to the geolocation and rules will | ||
be applied in the order of region, country, division, location. | ||
Returns the original geolocation if no geolocation rules apply. | ||
Raises a `CyclicGeolocationRulesError` if more than 1000 rules have | ||
been applied to the raw geolocation. | ||
""" | ||
transformed_values = geolocation | ||
rules_applied = 0 | ||
continue_to_apply = True | ||
|
||
while continue_to_apply: | ||
annotated_values = get_annotated_geolocation(geolocation_rules, transformed_values) | ||
|
||
# Stop applying rules if no annotated values were found | ||
if annotated_values is None: | ||
continue_to_apply = False | ||
else: | ||
rules_applied += 1 | ||
|
||
if rules_applied > 1000: | ||
raise CyclicGeolocationRulesError( | ||
"ERROR: More than 1000 geolocation rules applied on the same entry {geolocation!r}." | ||
) | ||
|
||
# Create a new list of values for comparison to previous values | ||
new_values = list(transformed_values) | ||
for index, value in enumerate(annotated_values): | ||
# Keep original value if annotated value is '*' | ||
if value != '*': | ||
new_values[index] = value | ||
|
||
# Stop applying rules if this rule did not change the values, | ||
# since this means we've reach rules with '*' that no longer change values | ||
if new_values == transformed_values: | ||
continue_to_apply = False | ||
|
||
transformed_values = new_values | ||
|
||
return transformed_values | ||
|
||
|
||
if __name__ == '__main__': | ||
parser = argparse.ArgumentParser( | ||
description=__doc__, | ||
formatter_class=argparse.ArgumentDefaultsHelpFormatter | ||
) | ||
parser.add_argument("--region-field", default="region", | ||
help="Field that contains regions in NDJSON records.") | ||
parser.add_argument("--country-field", default="country", | ||
help="Field that contains countries in NDJSON records.") | ||
parser.add_argument("--division-field", default="division", | ||
help="Field that contains divisions in NDJSON records.") | ||
parser.add_argument("--location-field", default="location", | ||
help="Field that contains location in NDJSON records.") | ||
parser.add_argument("--geolocation-rules", metavar="TSV", required=True, | ||
help="TSV file of geolocation rules with the format: " + | ||
"'<raw_geolocation><tab><annotated_geolocation>' where the raw and annotated geolocations " + | ||
"are formatted as '<region>/<country>/<division>/<location>'. " + | ||
"If creating a general rule, then the raw field value can be substituted with '*'." + | ||
"Lines starting with '#' will be ignored as comments." + | ||
"Trailing '#' will be ignored as comments.") | ||
|
||
args = parser.parse_args() | ||
|
||
location_fields = [args.region_field, args.country_field, args.division_field, args.location_field] | ||
|
||
geolocation_rules = load_geolocation_rules(args.geolocation_rules) | ||
|
||
for record in stdin: | ||
record = json.loads(record) | ||
|
||
try: | ||
annotated_values = transform_geolocations(geolocation_rules, [record.get(field, '') for field in location_fields]) | ||
except CyclicGeolocationRulesError as e: | ||
print(e, file=stderr) | ||
exit(1) | ||
|
||
for index, field in enumerate(location_fields): | ||
record[field] = annotated_values[index] | ||
|
||
json.dump(record, stdout, allow_nan=False, indent=None, separators=',:') | ||
print() |
Oops, something went wrong.