Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Apply filters in a loop #749

Closed
wants to merge 1 commit into from
Closed

Apply filters in a loop #749

wants to merge 1 commit into from

Conversation

huddlej
Copy link
Contributor

@huddlej huddlej commented Jul 14, 2021

Description of proposed changes

Reorganizes filtering logic such that we first create a list of filters to apply through a construct_filters function and then apply those filters through a new apply_filters function that loops through each filter. This new function returns strains to keep, exclude, and force-include. The lists of excluded and force-included strains also include the reasons why these strains were filtered.

Most other changes in this PR update the corresponding calling code in run to use these new functions. One intended benefit of this new approach to filtering is the ability to log the reason why each strain was filtered or force-included as an additional log output file, --output-log.

The general goal of this reorganization is to allow filtering to work for a full metadata file or individual chunks of metadata. A subsequent PR will further reorganize this code to use chunks of metadata and thereby limit the amount of memory used by the filter command at any given time.

These changes are all backward-compatible, although the addition of the --output-log argument warrants a new feature release.

Related issue(s)

Resolves #424
Related to #699

Testing

  • Adds functional tests and doctests for most new code. One exception is the new construct_filters function that is tested via existing functional tests.
  • Tested with ncov workflow's nextstrain-ci profile
  • Tested with a minimal seasonal flu H3N2 analysis

@codecov
Copy link

codecov bot commented Jul 19, 2021

Codecov Report

Merging #749 (eea1741) into master (653079e) will increase coverage by 0.47%.
The diff coverage is 60.56%.

❗ Current head eea1741 differs from pull request most recent head 8fb5405. Consider uploading reports for the commit 8fb5405 to get more accurate results
Impacted file tree graph

@@            Coverage Diff             @@
##           master     #749      +/-   ##
==========================================
+ Coverage   31.52%   32.00%   +0.47%     
==========================================
  Files          41       41              
  Lines        5754     5790      +36     
  Branches     1389     1406      +17     
==========================================
+ Hits         1814     1853      +39     
+ Misses       3865     3858       -7     
- Partials       75       79       +4     
Impacted Files Coverage Δ
augur/filter.py 56.41% <60.56%> (+4.32%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 653079e...8fb5405. Read the comment docs.

@huddlej huddlej force-pushed the loop-through-filters branch 2 times, most recently from dcc23d9 to 3bd3f00 Compare July 19, 2021 23:20
@huddlej huddlej marked this pull request as ready for review July 19, 2021 23:32
Reorganizes filtering logic such that we first create a list of filters
to apply through a `construct_filters` function and then apply those
filters through a new `apply_filters` function that loops through each
filter. This new function returns a tuple of strains in three categories
of strains to keep, strains to exclude, and strains to force include.
This function can be applied to a full metadata file or individual
chunks of metadata. The lists of excluded and force-included strains
also include the reasons why these strains were filtered.

The rest of this commit updates the corresponding calling code in `run`
to use these new functions. One intended benefit of this new approach to
filtering is the ability to log the reason why each strain was filtered
or force-included as an additional log output file, `--output-log`.
Copy link
Contributor Author

@huddlej huddlej left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am mostly happy with how this turned out, but I left some inline comments where I was uncertain about my choices. One really cool outcome of this PR is the --output-log that lets us view why strains were filtered like in this visidata summary for a seasonal influenza run:

C07C86C1-F1F5-40C5-95FA-FCB0BC7A7A7D

Or (for @tsibley):

filter count percent histogram
filter_by_sequence_length 2110 75.25 **************************************************
filter_by_exclude 384 13.69 *********
filter_by_exclude_where 249 8.88 *****
filter_by_non_nucleotide 61 2.18 *

Then you can zoom in more to see the kwargs for each filter:

image

return included


def construct_filters(args, metadata, sequence_index):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the function that I'm least happy about for a couple of reasons. The main advantage of this function is that it moves a bunch of if/else checks out of run, so we can more easily scan that function for the bigger picture operations.

The disadvantage is that this function simply repackages the contents from run into a separate place without making it that much simpler. The current interface is difficult to test by doctest or unit test, so it is only tested by functional tests.

I also wonder whether returning lists of dictionaries of the structure [{"function": <function>, "function_kwargs": {}}] would be better (clearer) than returning a tuple of function and args.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so we can more easily scan that function for the bigger picture operations.

This alone is important and valuable.

The current interface is difficult to test by doctest or unit test, so it is only tested by functional tests.

Interface of augur filter or construct_filters(…)? Can you say more about the impediments you encountered here to writing the kind of tests you'd want?

…better (clearer) than returning a tuple of function and args.

I think a tuple of function and args is a good representation here. It's what's expected by places it'd be used, like pd.DataFrame.pipe() and functools.partial().

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alternatively, instead of passing around (func, args) tuples, we could pass around just partially bound functions. This has the nice property of meaning the code just has to handle a list of arbitrary functions, and partial objects already have attributes we can read for --output-log reporting. We might want to augment the partials with a .__name__ or custom .__str__() routine, but that can happen opportunistically in --output-log formatting code.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similarly, we could instead turn the filtering functions you have now into factory functions that take filter params and return a customized filter function. This is akin to construct_filters() creating a list of partially-bound functions, but instead the filter functions themselves get delegated that task. This decouples it construct_filters() and let's it be customized independently. It also provides a point for each filter factory to verify filter params make sense or raise errors which construct_filters() doesn't need to be aware of.

Extending this, I think you could then get rid of construct_filters() almost entirely by making the argparse argument types the filter function factories, e.g. --include-where is defined with type=include_by_query (or whatever its named :-P) which produces a function based on the --include-where value. The "construction" of filters is then delegated, and all that's left is to arrange the various args, e.g. args.include_where, into include_by and exclude_by in the desired order.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you say more about the impediments you encountered here to writing the kind of tests you'd want?

I don't like mocking argparse objects for doctests or unit tests (it's a lot of boilerplate setup code that doesn't communicate much about the functions). I also don't like testing a long list of if statements, since it tends to feel like testing 2 + 2 = 4. Functional tests seem more appropriate here. My main issue with our current functional tests is that I have no sense of what code they cover. I know this isn't everything, but it is important to know when some bit of code is never run by tests.

...get rid of construct_filters() almost entirely by making the argparse argument types the filter function factories...

Thank you for the great ideas here! This specific approach is most appealing, at least as a place to validate inputs. Given that the current approach makes sense to you though, I'd leave this type of restructuring for a later PR where that's the complete focus of the PR.

augur/filter.py Show resolved Hide resolved
augur/filter.py Show resolved Hide resolved
@huddlej
Copy link
Contributor Author

huddlej commented Jul 31, 2021

Superseded by #750

@huddlej huddlej closed this Jul 31, 2021
@victorlin victorlin deleted the loop-through-filters branch June 30, 2023 20:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Record Filtered and Pruned Sequences
2 participants