Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

filter: Remove attempt at extracting variables from --query #1278

Merged
merged 5 commits into from
Aug 14, 2023

Conversation

victorlin
Copy link
Member

@victorlin victorlin commented Aug 11, 2023

Description of proposed changes

The attempt worked under limited testing, but introduced a bug that can only be resolved with further complex Pandas query parsing. I don't think that road is worth pursuing at the moment, so it's better to drop the effort entirely.

This partially reverts commit 2ead5b3 and related changes.

I chose to keep the definition of PandasUndefinedVariableError at the top-level¹ because it keeps external references next to each other, and typing_extensions in the dependency list² because it is bound to be useful at some point, and updating dependencies on Augur's Bioconda recipe is a hassle.

¹ 602e3d5
² 2658659

Related issue(s)

Fixes #1277

Testing

  • Test added to show change in behavior
  • Checks pass

Checklist

  • Add a message in CHANGES.md summarizing the changes in this PR that are end user focused. Keep headers and formatting consistent with the rest of the file.

@victorlin victorlin self-assigned this Aug 11, 2023
@victorlin victorlin marked this pull request as ready for review August 11, 2023 03:51
@victorlin victorlin requested a review from a team August 11, 2023 03:51
@codecov
Copy link

codecov bot commented Aug 11, 2023

Codecov Report

Patch coverage: 100.00% and project coverage change: -0.06% ⚠️

Comparison is base (1d92f6d) 69.80% compared to head (58e24a3) 69.74%.

Additional details and impacted files
@@            Coverage Diff             @@
##           master    #1278      +/-   ##
==========================================
- Coverage   69.80%   69.74%   -0.06%     
==========================================
  Files          67       67              
  Lines        7160     7146      -14     
  Branches     1742     1742              
==========================================
- Hits         4998     4984      -14     
  Misses       1855     1855              
  Partials      307      307              
Files Changed Coverage Δ
augur/filter/include_exclude_rules.py 97.77% <100.00%> (-0.17%) ⬇️

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Comment on lines 188 to 189
for column in metadata.columns:
metadata[column] = pd.to_numeric(metadata[column], errors='ignore')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe this is converting to numeric in place? Could that potentially affect downstream filters?
I'm specifically thinking of the date column being all year values, so they would convert to numeric. Then they would cause errors in any date filtering queries that expect them to be strings.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this errors only when using both --query and --exclude-ambiguous-dates-by. I wrote an example test for this which passes locally on master but fails when rebased onto this branch.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh yes, thanks for catching this! I added your example test in b5a77e1 and addressed with 9f9bc4f.

victorlin and others added 5 commits August 11, 2023 23:46
The attempt worked under limited testing, but introduced a bug that can
only be resolved with further complex Pandas query parsing. I don't
think that road is worth pursuing at the moment, so it's better to drop
the effort entirely.

This partially reverts commit 2ead5b3
and related changes.

I chose to keep the definition of PandasUndefinedVariableError at the
top-level¹ because it keeps external references next to each other, and
typing_extensions in the dependency list² because it is bound to be
useful at some point, and updating dependencies on Augur's Bioconda
recipe is a hassle.

¹ 602e3d5
² 2658659
Shows the interaction between the query and exclude-ambiguous-dates-by
options. This test currently does not work as intended and will be fixed
in the following commit.
This fixes the unexpected behavior described in the previous commit.
@victorlin victorlin force-pushed the victorlin/revert-pandas-query-variable-extraction branch from 9f9bc4f to 58e24a3 Compare August 12, 2023 03:47
@victorlin
Copy link
Member Author

I have confidence in these changes after @joverlee521's review, so merging without approval.

@victorlin victorlin merged commit 0a62483 into master Aug 14, 2023
26 checks passed
@victorlin victorlin deleted the victorlin/revert-pandas-query-variable-extraction branch August 14, 2023 13:38
Copy link
Contributor

@joverlee521 joverlee521 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Post-merge approval 👍

I'm not a huge fan of the blanket conversion to numeric, but it should cover a majority of the use cases for augur filter. There are probably weird edge cases but they're probably unavoidable until we allow user defined types.

@huddlej
Copy link
Contributor

huddlej commented Aug 15, 2023

There are probably weird edge cases but they're probably unavoidable until we allow user defined types.

Some that immediately come to mind are numeric strain names or clade names. In a current project, I just needed to write a query like this in a standalone Python notebook, where my clade names are SHA256 hashes that can be fully numeric:

large_frequencies.query(
    "(future_timepoint == '2018-01-01') & (clade_membership in ['8000939', 'd9f0744']) & (horizon == 12)"
)

If I tried to do a query like that with this PR's augur filter implementation, I think it would still work since not all clade names are numeric and the type conversion to numeric should silently fail. It's not a query I'm likely to do with augur, though.

@victorlin
Copy link
Member Author

victorlin commented Aug 16, 2023

@joverlee521

I'm not a huge fan of the blanket conversion to numeric, but it should cover a majority of the use cases for augur filter. There are probably weird edge cases but they're probably unavoidable until we allow user defined types.

I agree the blanket conversion is not ideal. This was a problem introduced with extract_variables in #1268, which does blanket conversion on all columns in the query string, not just the ones used for numerical comparison. The change here extends the blanket conversion to metadata columns unused by the query string.

I can't think of any harm as long as it's scoped to usage in this function (by not modifying values in-place). There are 2 possible scenarios:

  1. A column is expected to be numeric, but at least one value is non-numeric.
  2. A column is expected to be non-numeric, but all values are numeric.

(1) will silently fail numeric conversion. If a numerical query is applied to that column, it will expose the internal error to the user (e.g. '>=' not supported between instances of 'str' and 'float').

(2) is already a problem with pandas's automatic type inference. This impacts queries such as the example by @huddlej above in the case that all clade names happen to be numeric in the chunk of metadata being processed. I think the proper solution is user-defined types.

@tsibley
Copy link
Member

tsibley commented Aug 23, 2023

The attempt worked under limited testing, but introduced a bug that can only be resolved with further complex Pandas query parsing. I don't think that road is worth pursuing at the moment, so it's better to drop the effort entirely.

I think we could potentially sidestep further Pandas-specific query parsing and instead rely on the fact that Pandas' query grammar is a subset of Python's, and uses the ast stdlib under the hood. Since we only care about variable names (e.g. columns)—right?—I'd think we could parse the query with ast.parse() and then walk it to find the relevant ast.Name nodes.

@tsibley
Copy link
Member

tsibley commented Aug 23, 2023

Concretely, something like:

columns_used = [
    node.id
        for node in ast.walk(ast.parse(query))
         if isinstance(node, ast.Name) ]

@victorlin
Copy link
Member Author

@tsibley thanks, that's a very useful suggestion! Your snippet above works great and covers the syntax in #1277. I've integrated it into some other work which I hope to open a PR for soon.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
No open projects
Development

Successfully merging this pull request may close these issues.

filter: --query fails when the .str accessor is used on a column
4 participants