git subrepo clone (merge) https://github.com/nextstrain/ingest ingest…

…/vendored subrepo: subdir: "ingest/vendored" merged: "5d90818" upstream: origin: "https://github.com/nextstrain/ingest" branch: "main" commit: "5d90818" git-subrepo: version: "0.4.6" origin: "https://github.com/ingydotnet/git-subrepo" commit: "110b9eb"
nextstrain · Aug 8, 2023 · 7c9ba86 · 7c9ba86
1 parent 1cdd197
commit 7c9ba86
Show file tree

Hide file tree

Showing 22 changed files with 1,057 additions and 0 deletions.
diff --git a/ingest/vendored/.github/pull_request_template.md b/ingest/vendored/.github/pull_request_template.md
@@ -0,0 +1,16 @@
+### Description of proposed changes
+
+<!-- What is the goal of this pull request? What does this pull request change? -->
+
+### Related issue(s)
+
+<!-- Link any related issues here. -->
+
+### Checklist
+
+<!-- Make sure checks are successful at the bottom of the PR. -->
+
+- [ ] Checks pass
+- [ ] If adding a script, add an entry for it in the README.
+
+<!-- 🙌 Thank you for contributing to Nextstrain! ✨ -->
diff --git a/ingest/vendored/.github/workflows/ci.yaml b/ingest/vendored/.github/workflows/ci.yaml
@@ -0,0 +1,13 @@
+name: CI
+
+on:
+  - push
+  - pull_request
+  - workflow_dispatch
+
+jobs:
+  shellcheck:
+    runs-on: ubuntu-latest
+    steps:
+      - uses: actions/checkout@v3
+      - uses: nextstrain/.github/actions/shellcheck@master
diff --git a/ingest/vendored/.gitrepo b/ingest/vendored/.gitrepo
@@ -0,0 +1,12 @@
+; DO NOT EDIT (unless you know what you are doing)
+;
+; This subdirectory is a git "subrepo", and this file is maintained by the
+; git-subrepo command. See https://github.com/ingydotnet/git-subrepo#readme
+;
+[subrepo]
+	remote = https://github.com/nextstrain/ingest
+	branch = main
+	commit = 5d908187d13cbce27253af5163b4803d6bac03a6
+	parent = 1cdd1970924fed3fca67db43e8cb8f3de010ec37
+	method = merge
+	cmdver = 0.4.6
diff --git a/ingest/vendored/.shellcheckrc b/ingest/vendored/.shellcheckrc
@@ -0,0 +1,6 @@
+# Use of this file requires Shellcheck v0.7.0 or newer.
+#
+# SC2064 - We intentionally want variables to expand immediately within traps
+#          so the trap can not fail due to variable interpolation later.
+#
+disable=SC2064
diff --git a/ingest/vendored/README.md b/ingest/vendored/README.md
@@ -0,0 +1,91 @@
+# ingest
+
+Shared internal tooling for pathogen data ingest.  Used by our individual
+pathogen repos which produce Nextstrain builds.  Expected to be vendored by
+each pathogen repo using `git subtree`.
+
+Some tools may only live here temporarily before finding a permanent home in
+`augur curate` or Nextstrain CLI.  Others may happily live out their days here.
+
+## Vendoring
+
+Nextstrain maintained pathogen repos will use [`git subrepo`](https://github.com/ingydotnet/git-subrepo) to vendor ingest scripts.
+(See discussion on this decision in https://github.com/nextstrain/ingest/issues/3)
+
+If you don't already have `git subrepo` installed, follow the [git subrepo installation instructions](https://github.com/ingydotnet/git-subrepo#installation).
+Then add the latest ingest scripts to the pathogen repo by running:
+
+```
+git subrepo clone https://github.com/nextstrain/ingest ingest/vendored
+```
+
+Any future updates of ingest scripts can be pulled in with:
+
+```
+git subrepo pull ingest/vendored
+```
+
+## History
+
+Much of this tooling originated in
+[ncov-ingest](https://github.com/nextstrain/ncov-ingest) and was passaged thru
+[monkeypox's ingest/](https://github.com/nextstrain/monkeypox/tree/@/ingest/).
+It subsequently proliferated from [monkeypox][] to other pathogen repos
+([rsv][], [zika][], [dengue][], [hepatitisB][], [forecasts-ncov][]) primarily
+thru copying.  To [counter that
+proliferation](https://bedfordlab.slack.com/archives/C7SDVPBLZ/p1688577879947079),
+this repo was made.
+
+[monkeypox]: https://github.com/nextstrain/monkeypox
+[rsv]: https://github.com/nextstrain/rsv
+[zika]: https://github.com/nextstrain/zika/pull/24
+[dengue]: https://github.com/nextstrain/dengue/pull/10
+[hepatitisB]: https://github.com/nextstrain/hepatitisB
+[forecasts-ncov]: https://github.com/nextstrain/forecasts-ncov
+
+## Elsewhere
+
+The creation of this repo, in both the abstract and concrete, and the general
+approach to "ingest" has been discussed in various internal places, including:
+
+- https://github.com/nextstrain/private/issues/59
+- @joverlee521's [workflows document](https://docs.google.com/document/d/1rLWPvEuj0Ayc8MR0O1lfRJZfj9av53xU38f20g8nU_E/edit#heading=h.4g0d3mjvb89i)
+- [5 July 2023 Slack thread](https://bedfordlab.slack.com/archives/C7SDVPBLZ/p1688577879947079)
+- [6 July 2023 team meeting](https://docs.google.com/document/d/1FPfx-ON5RdqL2wyvODhkrCcjgOVX3nlXgBwCPhIEsco/edit)
+- _…many others_
+
+## Scripts
+
+Scripts for supporting ingest workflow automation that don’t really belong in any of our existing tools.
+
+- [notify-on-diff](notify-on-diff) - Send Slack message with diff of a local file and an S3 object
+- [notify-on-job-fail](notify-on-job-fail) - Send Slack message with details about failed workflow job on GitHub Actions and/or AWS Batch
+- [notify-on-job-start](notify-on-job-start) - Send Slack message with details about workflow job on GitHub Actions and/or AWS Batch
+- [notify-on-record-change](notify-on-recod-change) - Send Slack message with details about line count changes for a file compared to an S3 object's metadata `recordcount`.
+  If the S3 object's metadata does not have `recordcount`, then will attempt to download S3 object to count lines locally, which only supports `xz` compressed S3 objects.
+- [notify-slack](notify-slack) - Send message or file to Slack
+- [s3-object-exists](s3-object-exists) - Used to prevent 404 errors during S3 file comparisons in the notify-* scripts
+- [trigger](trigger) - Triggers downstream GitHub Actions via the GitHub API using repository_dispatch events.
+- [trigger-on-new-data](trigger-on-new-data) - Triggers downstream GitHub Actions if the provided `upload-to-s3` outputs do not contain the `identical_file_message`
+  A hacky way to ensure that we only trigger downstream phylogenetic builds if the S3 objects have been updated.
+
+Potential Nextstrain CLI scripts
+
+- [sha256sum](sha256sum) - Used to check if files are identical in upload-to-s3 and download-from-s3 scripts.
+- [cloudfront-invalidate](cloudfront-invalidate) - CloudFront invalidation is already supported in the [nextstrain remote command for S3 files](https://github.com/nextstrain/cli/blob/a5dda9c0579ece7acbd8e2c32a4bbe95df7c0bce/nextstrain/cli/remote/s3.py#L104).
+  This exists as a separate script to support CloudFront invalidation when using the upload-to-s3 script.
+- [upload-to-s3](upload-to-s3) - Upload file to AWS S3 bucket with compression based on file extension in S3 URL.
+  Skips upload if the local file's hash is identical to the S3 object's metadata `sha256sum`.
+  Adds the following user defined metadata to uploaded S3 object:
+    - `sha256sum` - hash of the file generated by [sha256sum](sha256sum)
+    - `recordcount` - the line count of the file
+- [download-from-s3](download-from-s3) - Download file from AWS S3 bucket with decompression based on file extension in S3 URL.
+  Skips download if the local file already exists and has a hash identical to the S3 object's metadata `sha256sum`.
+
+Potential augur curate scripts
+
+- [apply-geolocation-rules](apply-geolocation-rules) - Applies user curated geolocation rules to NDJSON records
+- [merge-user-metadata](merge-user-metadata) - Merges user annotations with NDJSON records
+- [transform-authors](transform-authors) - Abbreviates full author lists to '<first author> et al.'
+- [transform-field-names](transform-field-names) - Rename fields of NDJSON records
+- [transform-genbank-location](transform-genbank-location) - Parses `location` field with the expected pattern `"<country_value>[:<region>][, <locality>]"` based on [GenBank's country field](https://www.ncbi.nlm.nih.gov/genbank/collab/country/)
diff --git a/ingest/vendored/apply-geolocation-rules b/ingest/vendored/apply-geolocation-rules
@@ -0,0 +1,234 @@
+#!/usr/bin/env python3
+"""
+Applies user curated geolocation rules to the geolocation fields in the NDJSON
+records from stdin. The modified records are output to stdout. This does not do
+any additional transformations on top of the user curations.
+"""
+import argparse
+import json
+from collections import defaultdict
+from sys import exit, stderr, stdin, stdout
+
+
+class CyclicGeolocationRulesError(Exception):
+    pass
+
+
+def load_geolocation_rules(geolocation_rules_file):
+    """
+    Loads the geolocation rules from the provided *geolocation_rules_file*.
+    Returns the rules as a dict:
+    {
+        regions: {
+            countries: {
+                divisions: {
+                    locations: corrected_geolocations_tuple
+                }
+            }
+        }
+    }
+    """
+    geolocation_rules = defaultdict(lambda: defaultdict(lambda: defaultdict(dict)))
+    with open(geolocation_rules_file, 'r') as rules_fh:
+        for line in rules_fh:
+            # ignore comments
+            if line.strip()=="" or line.lstrip()[0] == '#':
+                continue
+
+            row = line.strip('\n').split('\t')
+            # Skip lines that cannot be split into raw and annotated geolocations
+            if len(row) != 2:
+                print(
+                    f"WARNING: Could not decode geolocation rule {line!r}.",
+                    "Please make sure rules are formatted as",
+                    "'region/country/division/location<tab>region/country/division/location'.",
+                    file=stderr)
+                continue
+
+            # remove trailing comments
+            row[-1] = row[-1].partition('#')[0].rstrip()
+            raw , annot = tuple( row[0].split('/') ) , tuple( row[1].split('/') )
+
+            # Skip lines where raw or annotated geolocations cannot be split into 4 fields
+            if len(raw) != 4:
+                print(
+                    f"WARNING: Could not decode the raw geolocation {row[0]!r}.",
+                    "Please make sure it is formatted as 'region/country/division/location'.",
+                    file=stderr
+                )
+                continue
+
+            if len(annot) != 4:
+                print(
+                    f"WARNING: Could not decode the annotated geolocation {row[1]!r}.",
+                    "Please make sure it is formatted as 'region/country/division/location'.",
+                    file=stderr
+                )
+                continue
+
+
+            geolocation_rules[raw[0]][raw[1]][raw[2]][raw[3]] = annot
+
+    return geolocation_rules
+
+
+def get_annotated_geolocation(geolocation_rules, raw_geolocation, rule_traversal = None):
+    """
+    Gets the annotated geolocation for the *raw_geolocation* in the provided
+    *geolocation_rules*.
+
+    Recursively traverses the *geolocation_rules* until we get the annotated
+    geolocation, which must be a Tuple. Returns `None` if there are no
+    applicable rules for the provided *raw_geolocation*.
+
+    Rules are applied in the order of region, country, division, location.
+    First checks the provided raw values for geolocation fields, then if there
+    are not matches, tries to use general rules marked with '*'.
+    """
+    # Always instantiate the rule traversal as an empty list if not provided,
+    # e.g. the first call of this recursive function
+    if rule_traversal is None:
+        rule_traversal = []
+
+    current_rules = geolocation_rules
+    # Traverse the geolocation rules based using the rule_traversal values
+    for field_value in rule_traversal:
+        current_rules = current_rules.get(field_value)
+        # If we hit `None`, then we know there are no matching rules, so stop the rule traversal
+        if current_rules is None:
+            break
+
+    # We've found the tuple of the annotated geolocation
+    if isinstance(current_rules, tuple):
+        return current_rules
+
+    # We've reach the next level of geolocation rules,
+    # so try to traverse the rules with the next target in raw_geolocation
+    if isinstance(current_rules, dict):
+        next_traversal_target = raw_geolocation[len(rule_traversal)]
+        rule_traversal.append(next_traversal_target)
+        return get_annotated_geolocation(geolocation_rules, raw_geolocation, rule_traversal)
+
+    # We did not find any matching rule for the last traversal target
+    if current_rules is None:
+        # If we've used all general rules and we still haven't found a match,
+        # then there are no applicable rules for this geolocation
+        if all(value == '*' for value in rule_traversal):
+            return None
+
+        # If we failed to find matching rule with a general rule as the last
+        # traversal target, then delete all trailing '*'s to reset rule_traversal
+        # to end with the last index that is currently NOT a '*'
+        # [A, *, B, *] => [A, *, B]
+        # [A, B, *, *] => [A, B]
+        # [A, *, *, *] => [A]
+        if rule_traversal[-1] == '*':
+            # Find the index of the first of the consecutive '*' from the
+            # end of the rule_traversal
+            # [A, *, B, *] => first_consecutive_general_rule_index = 3
+            # [A, B, *, *] => first_consecutive_general_rule_index = 2
+            # [A, *, *, *] => first_consecutive_general_rule_index = 1
+            for index, field_value in reversed(list(enumerate(rule_traversal))):
+                if field_value == '*':
+                    first_consecutive_general_rule_index = index
+                else:
+                    break
+
+            rule_traversal = rule_traversal[:first_consecutive_general_rule_index]
+
+        # Set the final value to '*' in hopes that by moving to a general rule,
+        # we can find a matching rule.
+        rule_traversal[-1] = '*'
+
+        return get_annotated_geolocation(geolocation_rules, raw_geolocation, rule_traversal)
+
+
+def transform_geolocations(geolocation_rules, geolocation):
+    """
+    Transform the provided *geolocation* by looking it up in the provided
+    *geolocation_rules*.
+
+    This will use all rules that apply to the geolocation and rules will
+    be applied in the order of region, country, division, location.
+
+    Returns the original geolocation if no geolocation rules apply.
+
+    Raises a `CyclicGeolocationRulesError` if more than 1000 rules have
+    been applied to the raw geolocation.
+    """
+    transformed_values = geolocation
+    rules_applied = 0
+    continue_to_apply = True
+
+    while continue_to_apply:
+        annotated_values = get_annotated_geolocation(geolocation_rules, transformed_values)
+
+        # Stop applying rules if no annotated values were found
+        if annotated_values is None:
+            continue_to_apply = False
+        else:
+            rules_applied += 1
+
+            if rules_applied > 1000:
+                raise CyclicGeolocationRulesError(
+                    "ERROR: More than 1000 geolocation rules applied on the same entry {geolocation!r}."
+                )
+
+            # Create a new list of values for comparison to previous values
+            new_values = list(transformed_values)
+            for index, value in enumerate(annotated_values):
+                # Keep original value if annotated value is '*'
+                if value != '*':
+                    new_values[index] = value
+
+            # Stop applying rules if this rule did not change the values,
+            # since this means we've reach rules with '*' that no longer change values
+            if new_values == transformed_values:
+                continue_to_apply = False
+
+            transformed_values = new_values
+
+    return transformed_values
+
+
+if __name__ == '__main__':
+    parser = argparse.ArgumentParser(
+        description=__doc__,
+        formatter_class=argparse.ArgumentDefaultsHelpFormatter
+    )
+    parser.add_argument("--region-field", default="region",
+        help="Field that contains regions in NDJSON records.")
+    parser.add_argument("--country-field", default="country",
+        help="Field that contains countries in NDJSON records.")
+    parser.add_argument("--division-field", default="division",
+        help="Field that contains divisions in NDJSON records.")
+    parser.add_argument("--location-field", default="location",
+        help="Field that contains location in NDJSON records.")
+    parser.add_argument("--geolocation-rules", metavar="TSV", required=True,
+        help="TSV file of geolocation rules with the format: " +
+             "'<raw_geolocation><tab><annotated_geolocation>' where the raw and annotated geolocations " +
+             "are formatted as '<region>/<country>/<division>/<location>'. " +
+             "If creating a general rule, then the raw field value can be substituted with '*'." +
+             "Lines starting with '#' will be ignored as comments." +
+             "Trailing '#' will be ignored as comments.")
+
+    args = parser.parse_args()
+
+    location_fields = [args.region_field, args.country_field, args.division_field, args.location_field]
+
+    geolocation_rules = load_geolocation_rules(args.geolocation_rules)
+
+    for record in stdin:
+        record = json.loads(record)
+
+        try:
+            annotated_values = transform_geolocations(geolocation_rules, [record.get(field, '') for field in location_fields])
+        except CyclicGeolocationRulesError as e:
+            print(e, file=stderr)
+            exit(1)
+
+        for index, field in enumerate(location_fields):
+            record[field] = annotated_values[index]
+
+        json.dump(record, stdout, allow_nan=False, indent=None, separators=',:')
+        print()