filter: Remove attempt at extracting variables from `--query` #1278

victorlin · 2023-08-11T03:37:58Z

Description of proposed changes

The attempt worked under limited testing, but introduced a bug that can only be resolved with further complex Pandas query parsing. I don't think that road is worth pursuing at the moment, so it's better to drop the effort entirely.

This partially reverts commit 2ead5b3 and related changes.

I chose to keep the definition of PandasUndefinedVariableError at the top-level¹ because it keeps external references next to each other, and typing_extensions in the dependency list² because it is bound to be useful at some point, and updating dependencies on Augur's Bioconda recipe is a hassle.

¹ 602e3d5
² 2658659

Related issue(s)

Fixes #1277

Testing

Test added to show change in behavior
Checks pass

Checklist

Add a message in CHANGES.md summarizing the changes in this PR that are end user focused. Keep headers and formatting consistent with the rest of the file.

codecov · 2023-08-11T03:56:12Z

Codecov Report

Patch coverage: 100.00% and project coverage change: -0.06% ⚠️

Comparison is base (1d92f6d) 69.80% compared to head (58e24a3) 69.74%.

Additional details and impacted files

@@            Coverage Diff             @@
##           master    #1278      +/-   ##
==========================================
- Coverage   69.80%   69.74%   -0.06%     
==========================================
  Files          67       67              
  Lines        7160     7146      -14     
  Branches     1742     1742              
==========================================
- Hits         4998     4984      -14     
  Misses       1855     1855              
  Partials      307      307

Files Changed	Coverage Δ
augur/filter/include_exclude_rules.py	`97.77% <100.00%> (-0.17%)`	⬇️

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

joverlee521 · 2023-08-11T18:31:56Z

augur/filter/include_exclude_rules.py

+    for column in metadata.columns:
        metadata[column] = pd.to_numeric(metadata[column], errors='ignore')


I believe this is converting to numeric in place? Could that potentially affect downstream filters?
I'm specifically thinking of the date column being all year values, so they would convert to numeric. Then they would cause errors in any date filtering queries that expect them to be strings.

I think this errors only when using both --query and --exclude-ambiguous-dates-by. I wrote an example test for this which passes locally on master but fails when rebased onto this branch.

Oh yes, thanks for catching this! I added your example test in b5a77e1 and addressed with 9f9bc4f.

The attempt worked under limited testing, but introduced a bug that can only be resolved with further complex Pandas query parsing. I don't think that road is worth pursuing at the moment, so it's better to drop the effort entirely. This partially reverts commit 2ead5b3 and related changes. I chose to keep the definition of PandasUndefinedVariableError at the top-level¹ because it keeps external references next to each other, and typing_extensions in the dependency list² because it is bound to be useful at some point, and updating dependencies on Augur's Bioconda recipe is a hassle. ¹ 602e3d5 ² 2658659

Shows the interaction between the query and exclude-ambiguous-dates-by options. This test currently does not work as intended and will be fixed in the following commit.

This fixes the unexpected behavior described in the previous commit.

victorlin · 2023-08-14T13:38:18Z

I have confidence in these changes after @joverlee521's review, so merging without approval.

joverlee521

Post-merge approval 👍

I'm not a huge fan of the blanket conversion to numeric, but it should cover a majority of the use cases for augur filter. There are probably weird edge cases but they're probably unavoidable until we allow user defined types.

huddlej · 2023-08-15T20:41:08Z

There are probably weird edge cases but they're probably unavoidable until we allow user defined types.

Some that immediately come to mind are numeric strain names or clade names. In a current project, I just needed to write a query like this in a standalone Python notebook, where my clade names are SHA256 hashes that can be fully numeric:

large_frequencies.query(
    "(future_timepoint == '2018-01-01') & (clade_membership in ['8000939', 'd9f0744']) & (horizon == 12)"
)

If I tried to do a query like that with this PR's augur filter implementation, I think it would still work since not all clade names are numeric and the type conversion to numeric should silently fail. It's not a query I'm likely to do with augur, though.

victorlin · 2023-08-16T15:23:21Z

@joverlee521

I'm not a huge fan of the blanket conversion to numeric, but it should cover a majority of the use cases for augur filter. There are probably weird edge cases but they're probably unavoidable until we allow user defined types.

I agree the blanket conversion is not ideal. This was a problem introduced with extract_variables in #1268, which does blanket conversion on all columns in the query string, not just the ones used for numerical comparison. The change here extends the blanket conversion to metadata columns unused by the query string.

I can't think of any harm as long as it's scoped to usage in this function (by not modifying values in-place). There are 2 possible scenarios:

A column is expected to be numeric, but at least one value is non-numeric.
A column is expected to be non-numeric, but all values are numeric.

(1) will silently fail numeric conversion. If a numerical query is applied to that column, it will expose the internal error to the user (e.g. '>=' not supported between instances of 'str' and 'float').

(2) is already a problem with pandas's automatic type inference. This impacts queries such as the example by @huddlej above in the case that all clade names happen to be numeric in the chunk of metadata being processed. I think the proper solution is user-defined types.

tsibley · 2023-08-23T21:54:55Z

The attempt worked under limited testing, but introduced a bug that can only be resolved with further complex Pandas query parsing. I don't think that road is worth pursuing at the moment, so it's better to drop the effort entirely.

I think we could potentially sidestep further Pandas-specific query parsing and instead rely on the fact that Pandas' query grammar is a subset of Python's, and uses the ast stdlib under the hood. Since we only care about variable names (e.g. columns)—right?—I'd think we could parse the query with ast.parse() and then walk it to find the relevant ast.Name nodes.

tsibley · 2023-08-23T22:02:37Z

Concretely, something like:

columns_used = [
    node.id
        for node in ast.walk(ast.parse(query))
         if isinstance(node, ast.Name) ]

victorlin · 2023-08-24T22:54:09Z

@tsibley thanks, that's a very useful suggestion! Your snippet above works great and covers the syntax in #1277. I've integrated it into some other work which I hope to open a PR for soon.

victorlin self-assigned this Aug 11, 2023

victorlin marked this pull request as ready for review August 11, 2023 03:51

victorlin requested a review from a team August 11, 2023 03:51

joverlee521 reviewed Aug 11, 2023

View reviewed changes

victorlin and others added 5 commits August 11, 2023 23:46

Add test showing existing behavior

9b99dc6

Update changelog

7d6866b

cram: Add filter-query-and-exclude-ambiguous-dates-by test

1d853c8

Shows the interaction between the query and exclude-ambiguous-dates-by options. This test currently does not work as intended and will be fixed in the following commit.

Do type conversions on a copy of the metadata

58e24a3

This fixes the unexpected behavior described in the previous commit.

victorlin force-pushed the victorlin/revert-pandas-query-variable-extraction branch from 9f9bc4f to 58e24a3 Compare August 12, 2023 03:47

victorlin merged commit 0a62483 into master Aug 14, 2023
26 checks passed

victorlin deleted the victorlin/revert-pandas-query-variable-extraction branch August 14, 2023 13:38

joverlee521 reviewed Aug 15, 2023

View reviewed changes

victorlin mentioned this pull request Aug 16, 2023

filter: Expand comment and testing for --query #1282

Merged

2 tasks

This was referenced Aug 25, 2023

Read a subset of metadata columns #1294

Merged

filter: Rewrite using SQLite3 #1242

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

filter: Remove attempt at extracting variables from `--query` #1278

filter: Remove attempt at extracting variables from `--query` #1278

victorlin commented Aug 11, 2023 •

edited

Loading

codecov bot commented Aug 11, 2023 •

edited

Loading

joverlee521 Aug 11, 2023

joverlee521 Aug 11, 2023

victorlin Aug 12, 2023

victorlin commented Aug 14, 2023

joverlee521 left a comment

huddlej commented Aug 15, 2023

victorlin commented Aug 16, 2023 •

edited

Loading

tsibley commented Aug 23, 2023

tsibley commented Aug 23, 2023

victorlin commented Aug 24, 2023

		for column in metadata.columns:
		metadata[column] = pd.to_numeric(metadata[column], errors='ignore')

filter: Remove attempt at extracting variables from --query #1278

filter: Remove attempt at extracting variables from --query #1278

Conversation

victorlin commented Aug 11, 2023 • edited Loading

Description of proposed changes

Related issue(s)

Testing

Checklist

codecov bot commented Aug 11, 2023 • edited Loading

Codecov Report

joverlee521 Aug 11, 2023

Choose a reason for hiding this comment

joverlee521 Aug 11, 2023

Choose a reason for hiding this comment

victorlin Aug 12, 2023

Choose a reason for hiding this comment

victorlin commented Aug 14, 2023

joverlee521 left a comment

Choose a reason for hiding this comment

huddlej commented Aug 15, 2023

victorlin commented Aug 16, 2023 • edited Loading

tsibley commented Aug 23, 2023

tsibley commented Aug 23, 2023

victorlin commented Aug 24, 2023

filter: Remove attempt at extracting variables from `--query` #1278

filter: Remove attempt at extracting variables from `--query` #1278

victorlin commented Aug 11, 2023 •

edited

Loading

codecov bot commented Aug 11, 2023 •

edited

Loading

victorlin commented Aug 16, 2023 •

edited

Loading