Skip to content

Commit

Permalink
Merge pull request #1509 from nextstrain/format-dates-mask-empty
Browse files Browse the repository at this point in the history
augur curate format-dates: mask empty fields
  • Loading branch information
joverlee521 authored Jul 1, 2024
2 parents 403f46f + c2f29cf commit 4264701
Show file tree
Hide file tree
Showing 4 changed files with 44 additions and 2 deletions.
6 changes: 6 additions & 0 deletions CHANGES.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,10 @@

## __NEXT__

### Major changes

* curate format-dates: Raises an error if provided date field does not exist in records. [#1509][] (@joverlee521)

### Features

* Added a new sub-command `augur curate apply-geolocation-rules` to apply user curated geolocation rules to the geolocation fields in a metadata file. Previously, this was available as a script within the nextstrain/ingest repo. [#1491][] (@victorlin)
Expand All @@ -15,13 +19,15 @@

* filter: Improve speed of checking duplicates in metadata, especially for large files. [#1466][] (@victorlin)
* curate: Stop adding double quotes to the metadata TSV output when field values have internal quotes. [#1493][] (@joverlee521)
* curate format-dates: Mask empty date values as `XXXX-XX-XX` to represent unknown dates. [#1509][] (@joverlee521)

[#1466]: https://github.com/nextstrain/augur/pull/1466
[#1490]: https://github.com/nextstrain/augur/pull/1490
[#1491]: https://github.com/nextstrain/augur/pull/1491
[#1493]: https://github.com/nextstrain/augur/pull/1493
[#1495]: https://github.com/nextstrain/augur/pull/1495
[#1501]: https://github.com/nextstrain/augur/pull/1501
[#1509]: https://github.com/nextstrain/augur/pull/1509

## 24.4.0 (15 May 2024)

Expand Down
12 changes: 10 additions & 2 deletions augur/curate/format_dates.py
Original file line number Diff line number Diff line change
Expand Up @@ -117,6 +117,10 @@ def format_date(date_string, expected_formats):
>>> expected_formats = ['%Y', '%Y-%m', '%Y-%m-%d', '%Y-%m-%dT%H:%M:%SZ', '%m-%d']
>>> format_date("", expected_formats)
'XXXX-XX-XX'
>>> format_date(" ", expected_formats)
'XXXX-XX-XX'
>>> format_date("01-01", expected_formats)
'XXXX-XX-XX'
>>> format_date("2020", expected_formats)
Expand All @@ -133,6 +137,10 @@ def format_date(date_string, expected_formats):
'2020-01-15'
"""

date_string = date_string.strip()
if date_string == '':
return 'XXXX-XX-XX'

for date_format in expected_formats:
try:
parsed_date = datetime.strptime(date_string, date_format)
Expand Down Expand Up @@ -180,8 +188,8 @@ def run(args, records):
for field in args.date_fields:
date_string = record.get(field)

if not date_string:
continue
if date_string is None:
raise AugurError(f"Expected date field {field!r} not found in record {record_id!r}.")

formatted_date_string = format_date(date_string, args.expected_date_formats)
if formatted_date_string is None:
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
Setup

$ export AUGUR="${AUGUR:-$TESTDIR/../../../../../bin/augur}"

Providing a date field that does not exist in the record should result in an error.

$ echo '{"record": 1, "date": "2024-01-01"}' \
> | ${AUGUR} curate format-dates \
> --date-fields "bad-date-field"
ERROR: Expected date field 'bad-date-field' not found in record 0.
[2]
17 changes: 17 additions & 0 deletions tests/functional/curate/cram/format-dates/empty-date-field.t
Original file line number Diff line number Diff line change
@@ -0,0 +1,17 @@
Setup

$ export AUGUR="${AUGUR:-$TESTDIR/../../../../../bin/augur}"

Test empty date value, which should be returned as a fully masked date.

$ echo '{"record": 1, "date": ""}' \
> | ${AUGUR} curate format-dates \
> --date-fields "date"
{"record": 1, "date": "XXXX-XX-XX"}

Test whitespace only date value, which should be returned as a fully masked date.

$ echo '{"record": 1, "date": " "}' \
> | ${AUGUR} curate format-dates \
> --date-fields "date"
{"record": 1, "date": "XXXX-XX-XX"}

0 comments on commit 4264701

Please sign in to comment.