Read a subset of metadata columns #1294

victorlin · 2023-08-25T18:07:06Z

Description of proposed changes

Reading a subset of metadata columns is straightforward for most subcommands, but requires some big changes in augur filter:

Previously, --output-metadata relied on an in-memory pandas DataFrame containing all columns. This PR re-implements that option and similar options using a custom Metadata class to write output metadata based on a streamed representation of the input file.
Reading a subset of columns for augur filter requires knowing which columns are used by augur filter --query. Automatic detection is implemented here, but --query is generic enough that I'm not confident the detection will always work. In cases where it doesn't, an error is raised.
A new option --query-columns allows the user to specify what columns are used in --query along with the expected data types.

See commits for details.

Related issues

This PR builds on ideas from discussion on Slack and experimentation in #1242.

Checklist

Tests modified for changes in behavior
Tests added for --query-columns
Checks pass
Add a message in CHANGES.md summarizing the changes in this PR that are end user focused. Keep headers and formatting consistent with the rest of the file.
Try running on large datasets (thread)

victorlin · 2023-09-01T22:52:31Z

CHANGES.md

Real-world testing

I ran this against the metadata file produced by ncov-ingest (s3://nextstrain-data/files/ncov/open/metadata.tsv.zst) which has 8.5 million rows x 58 columns. I used the following command to sample to 10 random sequences:

augur filter \ --metadata metadata.tsv \ --subsample-max-sequences 10 \ --output-metadata out.tsv

This took 8m21s to run on master, and 6m21s with changes from this PR. Here's profiling results before and after, which I visualized in Snakeviz. A summary:

Time spent accessing in-memory DataFrames dropped from 494s to 258s.

Without these changes, to_csv takes just a fraction of a second because the metadata for the 10 sequences is already loaded into memory.

With these changes, it takes 109s to run through the metadata file to find the lines for the 10 sequences that are wanted.

The example command benefits from a net positive improvement in run time. Although writing time increased due to 5173cb7, reading time decreased even more due to ac23e80.

This was a "best case scenario" for these changes though, since no metadata columns were used, only strain name. I should probably test with --group-by, --min-date, and other options that load additional columns to get a better picture.

I did not do any memory profiling. Memory usage is not an issue without these changes, and should be less of an issue with the changes.

Did you get round to doing more testing/profiling with --group-by and --min-date and what not?

No, not yet. Still planning to do so before merging.

Tested using the ncov 100k subsample as input to an augur filter command I grabbed from an ncov build. Run time was 16s on master and 7.37s with these changes (cProfile files).

Summary:

Metadata reading dropped from 4.3s to 2.15s (pandas readers.py:read)

Accessing in-memory DataFrames dropped from 8.63s to 1.127s (pandas indexing.py)

Output writing increased from ~0s to 1.35s (write_metadata_based_outputs)

augur filter \ --metadata ~/tmp/augur-filter/metadata.tsv.xz \ --include defaults/include.txt \ --exclude defaults/exclude.txt \ --max-date 6M \ --exclude-where 'region!=Asia' country=China country=India \ --group-by country year month \ --subsample-max-sequences 200 \ --output-strains ~/tmp/augur-filter/strains.txt

I triggered a ncov GISAID trial run using a Docker image including these changes - it completed successfully in 6h 36m 55s. This is pretty much the same as another trial run 2 days before at 6h 40m 27s. I don't know how much variance there is between run times, and I don't want to compare against non-trial runs or older runs (those have additional Slack notifications and different input metadata sizes). So by this comparison alone, there doesn't seem to be a significant performance benefit for the ncov workflow with GISAID configuration.

Hmm.

ISTM that last time I looked at ncov's execution profile, by far the slowest step was TreeTime. So not altogether surprising that filter's speed isn't a big impact in the context of a full build.

I grabbed the benchmarks/subsample_* files from those runs to get a little more granular insight into differences in wall clock time and max RSS for each subsample rule invocation.

avg(after - before) for wall clock time was -112s, so it shaved roughly 2 min off each subsample step on average. Equivalent for max RSS is -276 (MB, I believe).

augur/io/metadata.py

tsibley

Looks pretty good overall! Several small comments and one significant behaviour change to consider.

augur/io/metadata.py

augur/filter/_run.py

augur/filter/include_exclude_rules.py

augur/filter/__init__.py

augur/filter/io.py

augur/filter/include_exclude_rules.py

augur/filter/io.py

CHANGES.md

victorlin · 2024-01-19T23:25:37Z

#1252 changes the same parts of the code and is likely to be merged soon, so I'm going to rebase this PR on top of that.

victorlin · 2024-01-22T19:56:47Z

Force-pushed changes:

Added new output to tests
Split 3da1ced into 5a9fa15...d8b5765
Optimized usage of metadata_object in d8b5765
Address metadata column subsetting in other subcommands: d8b5765...25a7049

Will mark this PR as draft until conversations above are addressed.

codecov · 2024-02-03T00:58:39Z

Codecov Report

Attention: 12 lines in your changes are missing coverage. Please review.

Comparison is base (9b31ad8) 66.73% compared to head (b56f699) 67.13%.

Files	Patch %	Lines
augur/filter/include_exclude_rules.py	78.37%	5 Missing and 3 partials ⚠️
augur/filter/io.py	97.05%	1 Missing and 1 partial ⚠️
augur/io/metadata.py	95.55%	2 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##           master    #1294      +/-   ##
==========================================
+ Coverage   66.73%   67.13%   +0.39%     
==========================================
  Files          69       69              
  Lines        7321     7446     +125     
  Branches     1798     1831      +33     
==========================================
+ Hits         4886     4999     +113     
- Misses       2168     2176       +8     
- Partials      267      271       +4

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

tsibley

This LGTM. I haven't gone over the new stuff with as fine a toothed comb as my first pass, but I did skim thru it.

Metadata is defined by its delimiter, columns, and rows. These properties are inferred and returned in read_metadata, but that function is pandas-focused. This commit is part of a PR with aims to move away from using pandas to represent and access metadata at least in some parts of the code. Alternatives will need a way to access these properties, and this new class interface serves that purpose. It will be used in subsequent commits.

Previously, the output metadata and strain files were constructed as multi-stage exports of an in-memory DataFrame representation of the original metadata. This worked fine, except: 1. The order of rows was not guaranteed to follow that of the original metadata. 2. This relies on perfect round-tripping of file→DataFrame→file contents. 3. This relies on having all columns of the metadata available in the in-memory DataFrame representation. (1) is not a big deal, but is addressed by this change. (2) can be prone to issues, though nothing blocking has been brought to our attention. (3) is the main motivator given recent ideas to optimize I/O operations in augur filter. With this change, there is no longer a need to have all columns of the metadata stored in memory, which opens the door to further optimizations.

Previously added¹ then removed² due to being insufficient. Adding it back with a new approach and explicit limitations. ¹ 2ead5b3 ² e8322a5 This will be used in a later commit. Co-authored-by: Thomas Sibley <tsibley@fredhutch.org>

Converting queried columns should be much better than converting all columns: in most cases, only a small subset of metadata columns are used in the query. A stretch goal would be to convert only the columns used for numerical comparison. However, detecting which are used *as numeric* is a harder problem, and benefits are marginal for common queries that don't use many columns. Note that the column extraction function isn't perfect and it returns all or nothing. If nothing is returned, then fall back to the previous behavior of converting all columns.

This serves as an "escape hatch" for when automatic type inference does not work as expected for whatever reason.

The motivation here is that it's common for subcommands to use just a fraction of the columns. This should have performance improvements when the metadata file is large and most columns are unused. Subsequent commits will update subcommands to use this new feature.

Before these changes, all columns were read into memory even if they were not used for actual filtering. The only reason was because metadata-based outputs were created from the in-memory representation of metadata. An earlier commit switched to creating those outputs by doing another pass of the original metadata. That means there is no longer a need to have all columns in memory.

Before these changes, all columns were read into memory even though only a few are used. This reads in the minimum necessary columns, which brings performance improvements.

Before these changes, all columns were read into memory even though only the ID and date columns are used. This reads in just the two columns, which brings performance improvements.

victorlin self-assigned this Aug 25, 2023

victorlin requested a review from a team August 25, 2023 18:33

victorlin marked this pull request as ready for review August 25, 2023 18:33

victorlin mentioned this pull request Aug 25, 2023

filter: Rewrite using SQLite3 #1242

Draft

7 tasks

victorlin force-pushed the victorlin/improved-filter-io branch from a4a7892 to 7f9e462 Compare August 29, 2023 23:26

victorlin commented Sep 1, 2023

View reviewed changes

victorlin mentioned this pull request Jan 13, 2024

Use less dtype inference when reading metadata into DataFrames #1252

Merged

3 tasks

victorlin commented Jan 13, 2024

View reviewed changes

augur/io/metadata.py Outdated Show resolved Hide resolved

tsibley requested changes Jan 19, 2024

View reviewed changes

victorlin force-pushed the victorlin/improved-filter-io branch from 7f9e462 to 96e891c Compare January 19, 2024 23:44

victorlin changed the base branch from master to victorlin/read-metadata-dtypes January 19, 2024 23:45

victorlin force-pushed the victorlin/improved-filter-io branch 3 times, most recently from 0e35137 to 0038a4f Compare January 22, 2024 19:56

victorlin marked this pull request as draft January 22, 2024 19:56

victorlin force-pushed the victorlin/improved-filter-io branch from 0038a4f to b884236 Compare January 23, 2024 00:44

victorlin force-pushed the victorlin/read-metadata-dtypes branch from 2a90aab to 7e81765 Compare January 24, 2024 23:40

Base automatically changed from victorlin/read-metadata-dtypes to master January 24, 2024 23:43

victorlin mentioned this pull request Jan 25, 2024

frequencies: Annotate tips with the minimum necessary information #1398

Merged

3 tasks

victorlin force-pushed the victorlin/improved-filter-io branch 2 times, most recently from 64a5474 to 679519e Compare January 31, 2024 19:48

victorlin mentioned this pull request Jan 31, 2024

Miscellaneous changes #1406

Merged

2 tasks

victorlin changed the base branch from master to victorlin/miscellaneous January 31, 2024 19:52

victorlin force-pushed the victorlin/improved-filter-io branch from 679519e to 1a2dae9 Compare January 31, 2024 20:00

victorlin changed the title ~~filter: Improve I/O operations and query handling~~ Improve metadata I/O operations and filter query handling Feb 1, 2024

victorlin force-pushed the victorlin/improved-filter-io branch from 1a2dae9 to f0d0e47 Compare February 1, 2024 20:39

victorlin marked this pull request as ready for review February 1, 2024 20:40

victorlin requested a review from tsibley February 1, 2024 20:40

victorlin force-pushed the victorlin/improved-filter-io branch 2 times, most recently from 013fa01 to 3335625 Compare February 3, 2024 00:45

victorlin force-pushed the victorlin/miscellaneous branch from ff98de7 to 5be2639 Compare February 3, 2024 00:45

victorlin force-pushed the victorlin/improved-filter-io branch from 3335625 to 0924b31 Compare February 3, 2024 00:49

victorlin changed the title ~~Improve metadata I/O operations and filter query handling~~ Read a subset of metadata columns Feb 6, 2024

victorlin force-pushed the victorlin/improved-filter-io branch from 0924b31 to 1d52bc5 Compare February 7, 2024 00:10

victorlin force-pushed the victorlin/miscellaneous branch from 5be2639 to e13834d Compare February 7, 2024 23:29

Base automatically changed from victorlin/miscellaneous to master February 7, 2024 23:32

tsibley approved these changes Feb 7, 2024

View reviewed changes

victorlin and others added 4 commits February 7, 2024 17:43

victorlin force-pushed the victorlin/improved-filter-io branch from 1d52bc5 to 5d1c045 Compare February 8, 2024 01:53

victorlin added 6 commits February 7, 2024 18:14

filter: Add --query-columns option

b0a0d11

This serves as an "escape hatch" for when automatic type inference does not work as expected for whatever reason.

frequencies: Read a subset of metadata columns

dce0374

Before these changes, all columns were read into memory even though only a few are used. This reads in the minimum necessary columns, which brings performance improvements.

refine: Read a subset of metadata columns

00a600f

Before these changes, all columns were read into memory even though only the ID and date columns are used. This reads in just the two columns, which brings performance improvements.

Update changelog

b56f699

victorlin force-pushed the victorlin/improved-filter-io branch from 5d1c045 to b56f699 Compare February 8, 2024 02:14

victorlin merged commit 8678ae9 into master Feb 8, 2024
20 checks passed

victorlin deleted the victorlin/improved-filter-io branch February 8, 2024 02:57

This was referenced Feb 23, 2024

frequencies: error with --region flag #1423

Closed

frequencies: Fix --regions flag #1424

Merged

victorlin mentioned this pull request Aug 9, 2024

Speed up augur filter without replacing Pandas #1573

Open

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Read a subset of metadata columns #1294

Read a subset of metadata columns #1294

victorlin commented Aug 25, 2023 •

edited

Loading

victorlin Sep 1, 2023

tsibley Jan 19, 2024

victorlin Jan 19, 2024

victorlin Feb 1, 2024 •

edited

Loading

victorlin Feb 1, 2024 •

edited

Loading

tsibley Feb 7, 2024

tsibley left a comment

victorlin commented Jan 19, 2024

victorlin commented Jan 22, 2024

codecov bot commented Feb 3, 2024 •

edited

Loading

tsibley left a comment

Read a subset of metadata columns #1294

Read a subset of metadata columns #1294

Conversation

victorlin commented Aug 25, 2023 • edited Loading

Description of proposed changes

Related issues

Checklist

victorlin Sep 1, 2023

Choose a reason for hiding this comment

Real-world testing

tsibley Jan 19, 2024

Choose a reason for hiding this comment

victorlin Jan 19, 2024

Choose a reason for hiding this comment

victorlin Feb 1, 2024 • edited Loading

Choose a reason for hiding this comment

victorlin Feb 1, 2024 • edited Loading

Choose a reason for hiding this comment

tsibley Feb 7, 2024

Choose a reason for hiding this comment

tsibley left a comment

Choose a reason for hiding this comment

victorlin commented Jan 19, 2024

victorlin commented Jan 22, 2024

codecov bot commented Feb 3, 2024 • edited Loading

Codecov Report

tsibley left a comment

Choose a reason for hiding this comment

victorlin commented Aug 25, 2023 •

edited

Loading

victorlin Feb 1, 2024 •

edited

Loading

victorlin Feb 1, 2024 •

edited

Loading

codecov bot commented Feb 3, 2024 •

edited

Loading