Apply filters in a loop #749

huddlej · 2021-07-14T00:11:42Z

Description of proposed changes

Reorganizes filtering logic such that we first create a list of filters to apply through a construct_filters function and then apply those filters through a new apply_filters function that loops through each filter. This new function returns strains to keep, exclude, and force-include. The lists of excluded and force-included strains also include the reasons why these strains were filtered.

Most other changes in this PR update the corresponding calling code in run to use these new functions. One intended benefit of this new approach to filtering is the ability to log the reason why each strain was filtered or force-included as an additional log output file, --output-log.

The general goal of this reorganization is to allow filtering to work for a full metadata file or individual chunks of metadata. A subsequent PR will further reorganize this code to use chunks of metadata and thereby limit the amount of memory used by the filter command at any given time.

These changes are all backward-compatible, although the addition of the --output-log argument warrants a new feature release.

Related issue(s)

Resolves #424
Related to #699

Testing

Adds functional tests and doctests for most new code. One exception is the new construct_filters function that is tested via existing functional tests.
Tested with ncov workflow's nextstrain-ci profile
Tested with a minimal seasonal flu H3N2 analysis

codecov · 2021-07-19T16:22:12Z

Codecov Report

Merging #749 (eea1741) into master (653079e) will increase coverage by 0.47%.
The diff coverage is 60.56%.

❗ Current head eea1741 differs from pull request most recent head 8fb5405. Consider uploading reports for the commit 8fb5405 to get more accurate results

@@            Coverage Diff             @@
##           master     #749      +/-   ##
==========================================
+ Coverage   31.52%   32.00%   +0.47%     
==========================================
  Files          41       41              
  Lines        5754     5790      +36     
  Branches     1389     1406      +17     
==========================================
+ Hits         1814     1853      +39     
+ Misses       3865     3858       -7     
- Partials       75       79       +4

Impacted Files	Coverage Δ
augur/filter.py	`56.41% <60.56%> (+4.32%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 653079e...8fb5405. Read the comment docs.

Reorganizes filtering logic such that we first create a list of filters to apply through a `construct_filters` function and then apply those filters through a new `apply_filters` function that loops through each filter. This new function returns a tuple of strains in three categories of strains to keep, strains to exclude, and strains to force include. This function can be applied to a full metadata file or individual chunks of metadata. The lists of excluded and force-included strains also include the reasons why these strains were filtered. The rest of this commit updates the corresponding calling code in `run` to use these new functions. One intended benefit of this new approach to filtering is the ability to log the reason why each strain was filtered or force-included as an additional log output file, `--output-log`.

huddlej

I am mostly happy with how this turned out, but I left some inline comments where I was uncertain about my choices. One really cool outcome of this PR is the --output-log that lets us view why strains were filtered like in this visidata summary for a seasonal influenza run:

Or (for @tsibley):

filter	count	percent	histogram
filter_by_sequence_length	2110	75.25	**************************************************
filter_by_exclude	384	13.69	*********
filter_by_exclude_where	249	8.88	*****
filter_by_non_nucleotide	61	2.18	*

Then you can zoom in more to see the kwargs for each filter:

huddlej · 2021-07-20T00:10:38Z

augur/filter.py

+    return included
+
+
+def construct_filters(args, metadata, sequence_index):


This is the function that I'm least happy about for a couple of reasons. The main advantage of this function is that it moves a bunch of if/else checks out of run, so we can more easily scan that function for the bigger picture operations.

The disadvantage is that this function simply repackages the contents from run into a separate place without making it that much simpler. The current interface is difficult to test by doctest or unit test, so it is only tested by functional tests.

I also wonder whether returning lists of dictionaries of the structure [{"function": <function>, "function_kwargs": {}}] would be better (clearer) than returning a tuple of function and args.

so we can more easily scan that function for the bigger picture operations.

This alone is important and valuable.

The current interface is difficult to test by doctest or unit test, so it is only tested by functional tests.

Interface of augur filter or construct_filters(…)? Can you say more about the impediments you encountered here to writing the kind of tests you'd want?

…better (clearer) than returning a tuple of function and args.

I think a tuple of function and args is a good representation here. It's what's expected by places it'd be used, like pd.DataFrame.pipe() and functools.partial().

Alternatively, instead of passing around (func, args) tuples, we could pass around just partially bound functions. This has the nice property of meaning the code just has to handle a list of arbitrary functions, and partial objects already have attributes we can read for --output-log reporting. We might want to augment the partials with a .__name__ or custom .__str__() routine, but that can happen opportunistically in --output-log formatting code.

Similarly, we could instead turn the filtering functions you have now into factory functions that take filter params and return a customized filter function. This is akin to construct_filters() creating a list of partially-bound functions, but instead the filter functions themselves get delegated that task. This decouples it construct_filters() and let's it be customized independently. It also provides a point for each filter factory to verify filter params make sense or raise errors which construct_filters() doesn't need to be aware of.

Extending this, I think you could then get rid of construct_filters() almost entirely by making the argparse argument types the filter function factories, e.g. --include-where is defined with type=include_by_query (or whatever its named :-P) which produces a function based on the --include-where value. The "construction" of filters is then delegated, and all that's left is to arrange the various args, e.g. args.include_where, into include_by and exclude_by in the desired order.

Can you say more about the impediments you encountered here to writing the kind of tests you'd want?

I don't like mocking argparse objects for doctests or unit tests (it's a lot of boilerplate setup code that doesn't communicate much about the functions). I also don't like testing a long list of if statements, since it tends to feel like testing 2 + 2 = 4. Functional tests seem more appropriate here. My main issue with our current functional tests is that I have no sense of what code they cover. I know this isn't everything, but it is important to know when some bit of code is never run by tests.

...get rid of construct_filters() almost entirely by making the argparse argument types the filter function factories...

Thank you for the great ideas here! This specific approach is most appealing, at least as a place to validate inputs. Given that the current approach makes sense to you though, I'd leave this type of restructuring for a later PR where that's the complete focus of the PR.

augur/filter.py

huddlej · 2021-07-31T00:05:56Z

Superseded by #750

huddlej mentioned this pull request Jul 14, 2021

Reduce metadata memory usage in augur filter #699

Closed

huddlej force-pushed the loop-through-filters branch from f253140 to 0274f08 Compare July 16, 2021 16:24

huddlej force-pushed the loop-through-filters branch 2 times, most recently from dcc23d9 to 3bd3f00 Compare July 19, 2021 23:20

huddlej marked this pull request as ready for review July 19, 2021 23:32

huddlej force-pushed the loop-through-filters branch from 3bd3f00 to 8fb5405 Compare July 20, 2021 00:05

huddlej commented Jul 20, 2021

View reviewed changes

huddlej requested review from tsibley and joverlee521 July 20, 2021 00:16

huddlej commented Jul 20, 2021

View reviewed changes

augur/filter.py Show resolved Hide resolved

This was referenced Jul 27, 2021

Apply filters and group-by logic to a metadata iterator #750

Merged

Throw an error when any requested filter or group-by columns don't exist #754

Closed

huddlej closed this Jul 31, 2021

victorlin deleted the loop-through-filters branch June 30, 2023 20:02

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Apply filters in a loop #749

Apply filters in a loop #749

huddlej commented Jul 14, 2021 •

edited

Loading

codecov bot commented Jul 19, 2021 •

edited

Loading

huddlej left a comment •

edited

Loading

huddlej Jul 20, 2021

tsibley Jul 20, 2021

tsibley Jul 20, 2021

tsibley Jul 20, 2021

huddlej Jul 27, 2021

huddlej commented Jul 31, 2021

		return included


		def construct_filters(args, metadata, sequence_index):

Apply filters in a loop #749

Apply filters in a loop #749

Conversation

huddlej commented Jul 14, 2021 • edited Loading

Description of proposed changes

Related issue(s)

Testing

codecov bot commented Jul 19, 2021 • edited Loading

Codecov Report

huddlej left a comment • edited Loading

Choose a reason for hiding this comment

huddlej Jul 20, 2021

Choose a reason for hiding this comment

tsibley Jul 20, 2021

Choose a reason for hiding this comment

tsibley Jul 20, 2021

Choose a reason for hiding this comment

tsibley Jul 20, 2021

Choose a reason for hiding this comment

huddlej Jul 27, 2021

Choose a reason for hiding this comment

huddlej commented Jul 31, 2021

huddlej commented Jul 14, 2021 •

edited

Loading

codecov bot commented Jul 19, 2021 •

edited

Loading

huddlej left a comment •

edited

Loading