Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

filter: Expand comment and testing for --query #1282

Merged
merged 2 commits into from
Aug 21, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
15 changes: 14 additions & 1 deletion augur/filter/include_exclude_rules.py
Original file line number Diff line number Diff line change
Expand Up @@ -187,7 +187,20 @@ def filter_by_query(metadata, query) -> FilterFunctionReturn:
# Create a copy to prevent modification of the original DataFrame.
metadata_copy = metadata.copy()

# Try converting all columns to numeric.
# Support numeric comparisons in query strings.
#
# The built-in data type inference when loading the DataFrame does not
# support nullable numeric columns, so numeric comparisons won't work on
# those columns. pd.to_numeric does proper conversion on those columns, and
# will not make any changes to columns with other values.
#
# TODO: Parse the query string and apply conversion only to columns used for
# numeric comparison. Pandas does not expose the API used to parse the query
# string internally, so this is non-trivial and requires a bit of
# reverse-engineering. Commit 2ead5b3e3306dc1100b49eb774287496018122d9 got
# halfway there but had issues so it was reverted.
#
# TODO: Try boolean conversion?
for column in metadata_copy.columns:
metadata_copy[column] = pd.to_numeric(metadata_copy[column], errors='ignore')

Expand Down
21 changes: 16 additions & 5 deletions tests/functional/filter/cram/filter-query-numerical.t
Original file line number Diff line number Diff line change
Expand Up @@ -5,11 +5,11 @@ Setup
Create metadata file for testing.

$ cat >metadata.tsv <<~~
> strain coverage
> SEQ_1 0.94
> SEQ_2 0.95
> SEQ_3 0.96
> SEQ_4
> strain coverage category
> SEQ_1 0.94 A
> SEQ_2 0.95 B
> SEQ_3 0.96 C
> SEQ_4
> ~~

The 'coverage' column should be query-able by numerical comparisons.
Expand All @@ -22,3 +22,14 @@ The 'coverage' column should be query-able by numerical comparisons.
$ sort filtered_strains.txt
SEQ_2
SEQ_3

The 'category' column will fail when used with a numerical comparison.

$ ${AUGUR} filter \
> --metadata metadata.tsv \
> --query "category >= 0.95" \
> --output-strains filtered_strains.txt
ERROR: Internal Pandas error when applying query:
'>=' not supported between instances of 'str' and 'float'
Ensure the syntax is valid per <https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#indexing-query>.
[2]
Loading