Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

filter: Try converting all queried columns to numerical type #1268

Merged
merged 5 commits into from
Jul 31, 2023

Conversation

victorlin
Copy link
Member

@victorlin victorlin commented Jul 28, 2023

Description of proposed changes

The dtype inference in augur.io.read_metadata does not support numerical columns with empty values (because it calls pandas.read_csv with na_filter=False¹). This gets around that limitation by converting columns before querying.

I also considered infer_objects and convert_dtypes, but those are not useful here since they only support soft (not hard) conversions².

¹ a1bfce4
² https://stackoverflow.com/a/60278450

Related issue(s)

Fixes #1269

Addresses #1252 (comment)

Prompted by in-lab discussion with a user.

Testing

  • Test added and updated to show change in behavior
  • Checks pass

Checklist

  • Add a message in CHANGES.md summarizing the changes in this PR that are end user focused. Keep headers and formatting consistent with the rest of the file.

@victorlin victorlin self-assigned this Jul 28, 2023
@victorlin victorlin force-pushed the victorlin/filter-query-numerical branch from 23f409c to cdeb40d Compare July 28, 2023 19:36
@codecov
Copy link

codecov bot commented Jul 28, 2023

Codecov Report

Patch coverage: 100.00% and project coverage change: +0.31% 🎉

Comparison is base (4f5559a) 69.36% compared to head (ce756c3) 69.67%.
Report is 6 commits behind head on master.

Additional details and impacted files
@@            Coverage Diff             @@
##           master    #1268      +/-   ##
==========================================
+ Coverage   69.36%   69.67%   +0.31%     
==========================================
  Files          66       67       +1     
  Lines        7024     7104      +80     
  Branches     1708     1727      +19     
==========================================
+ Hits         4872     4950      +78     
- Misses       1847     1848       +1     
- Partials      305      306       +1     
Files Changed Coverage Δ
augur/filter/include_exclude_rules.py 97.93% <100.00%> (+0.19%) ⬆️

... and 2 files with indirect coverage changes

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@victorlin victorlin marked this pull request as ready for review July 28, 2023 19:42
@victorlin victorlin requested a review from a team July 28, 2023 19:42
@victorlin victorlin force-pushed the victorlin/filter-query-numerical branch from 4a99511 to 5c8f6b8 Compare July 28, 2023 22:59
The dtype inference in augur.io.read_metadata does not support numerical
columns with empty values (because it calls pandas.read_csv with
na_filter=False¹). This gets around that limitation by converting
columns before querying.

I also considered infer_objects and convert_dtypes, but those are not
useful here since they only support soft (not hard) conversions².

¹ a1bfce4
² https://stackoverflow.com/a/60278450
In the end, it's only worth calling to_numeric on the columns used for
numerical comparison.

This gets us halfway there, since in most cases, only a small subset of
metadata columns are used in the query.

This is a hacky approach, but it is more computationally efficient.
Since this is now being used in multiple places of the file.
@victorlin victorlin force-pushed the victorlin/filter-query-numerical branch from 5c8f6b8 to ce756c3 Compare July 28, 2023 23:02
Comment on lines +193 to +195
# Try converting all queried columns to numeric.
for column in extract_variables(query).intersection(metadata.columns):
metadata[column] = pd.to_numeric(metadata[column], errors='ignore')
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the force-pushes above, I split 77ef4a5 into b325b97 + 2ead5b3.

@victorlin
Copy link
Member Author

Merging pre-review since this is a functional improvement, as noted on Slack.

@victorlin victorlin merged commit a35f7a6 into master Jul 31, 2023
26 checks passed
@victorlin victorlin deleted the victorlin/filter-query-numerical branch July 31, 2023 18:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
No open projects
Development

Successfully merging this pull request may close these issues.

filter: --query fails when numerical comparisons are used on columns with missing values
1 participant