filter: Rewrite using SQLite3 #1242

This will be used in future commits by a new implementation of augur filter.

These will be used in future commits by a new implementation of augur filter.

To be used in future commits.

This serves multiple purposes: 1. Detect duplicates. 2. Speed up strain-based queries. 3. Provide a marker for a single primary index column (akin to a pandas DataFrame index column). Also adds DuplicateError, a new exception class to expose duplicates for custom error messages. This will be used in a future commit by a new implementation of augur filter.

This is unused but can come in handy when debugging a query.

…alue() This will be used in a future commit by a new implementation of augur filter.

… constants.py These variables are temporary to bridge the transition from using shared variables to a shared database. They will be removed in a future commit.

WARNING: This change won't work as-is. Broken out as a separate commit for easier review. Remove metadata input and examples in include_exclude_rules. These are not applicable in the SQLite-based implementation.

WARNING: This change won't work as-is. Broken out as a separate commit for easier review. Remove sequence index parameter in include_exclude_rules. These are not applicable in the SQLite-based implementation. It also removes the need to check for DataFrame instances in kwargs.

This is the final change that replaces the Pandas-based implementation of augur filter with a SQLite3-based implementation. Breaking changes in behavior (see changes under tests/): 1. `--subsample-seed` is still deterministic but differs from previous implementation. 2. `--include*`: Previously, both exclusion and force-inclusion would be shown in console output and `--output-log`. Now, only the force-inclusion is reported. Implementation differences with no functional changes: 1. Tabular files are loaded into tables in a temporary SQLite3 database file on disk rather than Pandas DataFrames. This generally means less memory usage and more disk usage. Tables are indexed on strain. 2. Since chunked loading of metadata was introduced to avoid high memory usage¹, that is no longer necessary and all operations are now on the entire metadata (except for `--query`/`--query-pandas`). 3. For large datasets, the SQLite3 implementation is much faster than the Pandas implementation. 4. Instead of relying on continually updated variables (e.g. `valid_strains`), new tables in the database are created at various stages in the process. The "filter reason table" is used as a source of truth for all outputs (and is conveniently a ~direct representation of `--output-log`). This also allows the function `augur.filter._run.run()` to be further broken into smaller parts. 5. Exclusion/inclusion is done using WHERE expressions. 6. For subsampling, priority queues are no longer necessary, as the the highest priority strains can be determined using a ORDER BY across all strains. 7. Date parsing has improved with caching and a a min/max approach to resolving date ranges. Note that sequence I/O remains unchanged. ¹ 87ca73c

The re-implementation of augur filter accounts for this directly in filter_by_ambiguous_date(). Remove old tests and add equivalent tests. Note that some date strings have been changed since the previous values would fail date format validation.

Now that disk space usage is uncapped (compared to previous pandas implementation using in-memory chunks), it can be useful to know how large the database file was. However, I'll have to think more about how useful this is once database files are passed in by the user. Ideas: - Mark this as an experimental feature for `augur filter`, to be changed or removed with any version. - Add it to the `augur db` interface, e.g. output of `augur db inspect`. However, it can still be useful to know the sizes of "intermediate" tables. It'd also be useful to add runtime information here. Ideas: - Print run times of each "major" function in real-time. This can probably be achieved by some sort of decorator function.

This adds a new flag to query the SQLite database natively. `--query`/`--query-pandas` will still behave as expected. All Pandas-based query functions are renamed to be Pandas-specific. To avoid breaking changes, alias `--query` to `--query-pandas`.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

filter: Rewrite using SQLite3 #1242

filter: Rewrite using SQLite3 #1242

Commits on Sep 6, 2024

filter: Rewrite using SQLite3 #1242

Are you sure you want to change the base?

filter: Rewrite using SQLite3 #1242

Commits on Sep 6, 2024