Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

filter: Rewrite using SQLite3 #1242

Draft
wants to merge 15 commits into
base: master
Choose a base branch
from

Commits on Sep 6, 2024

  1. io: Add class interfaces for working with SQLite3

    This will be used in future commits by a new implementation of augur
    filter.
    victorlin committed Sep 6, 2024
    Configuration menu
    Copy the full SHA
    068a216 View commit details
    Browse the repository at this point in the history
  2. io: Split Metadata class into TabularFile and File

    These will be used in future commits by a new implementation of augur
    filter.
    victorlin committed Sep 6, 2024
    Configuration menu
    Copy the full SHA
    28f49a9 View commit details
    Browse the repository at this point in the history
  3. Allow custom delimiter, header, columns to TabularFile

    To be used in future commits.
    victorlin committed Sep 6, 2024
    Configuration menu
    Copy the full SHA
    acc4913 View commit details
    Browse the repository at this point in the history
  4. io/sqlite3: Add methods to manage primary indexes

    This serves multiple purposes:
    
    1. Detect duplicates.
    2. Speed up strain-based queries.
    3. Provide a marker for a single primary index column (akin to a pandas
       DataFrame index column).
    
    Also adds DuplicateError, a new exception class to expose duplicates for
    custom error messages.
    
    This will be used in a future commit by a new implementation of augur
    filter.
    victorlin committed Sep 6, 2024
    Configuration menu
    Copy the full SHA
    e5348e8 View commit details
    Browse the repository at this point in the history
  5. io/sqlite3: Add debugging function

    This is unused but can come in handy when debugging a query.
    victorlin committed Sep 6, 2024
    Configuration menu
    Copy the full SHA
    0c71334 View commit details
    Browse the repository at this point in the history
  6. dates: Allow ambiguity resolution method in get_numerical_date_from_v…

    …alue()
    
    This will be used in a future commit by a new implementation of augur
    filter.
    victorlin committed Sep 6, 2024
    Configuration menu
    Copy the full SHA
    22b8054 View commit details
    Browse the repository at this point in the history
  7. filter/run: Split into smaller functions by using shared variables in…

    … constants.py
    
    These variables are temporary to bridge the transition from using shared
    variables to a shared database. They will be removed in a future commit.
    victorlin committed Sep 6, 2024
    Configuration menu
    Copy the full SHA
    9f08218 View commit details
    Browse the repository at this point in the history
  8. (1/3) filter: Rewrite using SQLite3

    WARNING: This change won't work as-is. Broken out as a separate commit
    for easier review.
    
    Remove metadata input and examples in include_exclude_rules. These are
    not applicable in the SQLite-based implementation.
    victorlin committed Sep 6, 2024
    Configuration menu
    Copy the full SHA
    2e46b3a View commit details
    Browse the repository at this point in the history
  9. 🚧 (2/3) filter: Rewrite using SQLite3

    WARNING: This change won't work as-is. Broken out as a separate commit
    for easier review.
    
    Remove sequence index parameter in include_exclude_rules. These are not
    applicable in the SQLite-based implementation. It also removes the need
    to check for DataFrame instances in kwargs.
    victorlin committed Sep 6, 2024
    Configuration menu
    Copy the full SHA
    237273b View commit details
    Browse the repository at this point in the history
  10. 🚧 (3/3) filter: Rewrite using SQLite3

    This is the final change that replaces the Pandas-based implementation
    of augur filter with a SQLite3-based implementation.
    
    Breaking changes in behavior (see changes under tests/):
    
    1. `--subsample-seed` is still deterministic but differs from previous
       implementation.
    2. `--include*`: Previously, both exclusion and force-inclusion would be
       shown in console output and `--output-log`. Now, only the
       force-inclusion is reported.
    
    Implementation differences with no functional changes:
    
    1. Tabular files are loaded into tables in a temporary SQLite3 database
       file on disk rather than Pandas DataFrames. This generally means less
       memory usage and more disk usage. Tables are indexed on strain.
    2. Since chunked loading of metadata was introduced to avoid high memory
       usage¹, that is no longer necessary and all operations are now on the
       entire metadata (except for `--query`/`--query-pandas`).
    3. For large datasets, the SQLite3 implementation is much faster than
       the Pandas implementation.
    4. Instead of relying on continually updated variables (e.g.
       `valid_strains`), new tables in the database are created at various
       stages in the process. The "filter reason table" is used as a source
       of truth for all outputs (and is conveniently a ~direct
       representation of `--output-log`). This also allows the function
       `augur.filter._run.run()` to be further broken into smaller parts.
    5. Exclusion/inclusion is done using WHERE expressions.
    6. For subsampling, priority queues are no longer necessary, as the the
       highest priority strains can be determined using a ORDER BY across
       all strains.
    7. Date parsing has improved with caching and a a min/max approach to
       resolving date ranges.
    
    Note that sequence I/O remains unchanged.
    
    ¹ 87ca73c
    victorlin committed Sep 6, 2024
    Configuration menu
    Copy the full SHA
    fdf131b View commit details
    Browse the repository at this point in the history
  11. dates: Remove unused is_date_ambiguous()

    The re-implementation of augur filter accounts for this directly in
    filter_by_ambiguous_date().
    
    Remove old tests and add equivalent tests. Note that some date strings
    have been changed since the previous values would fail date format
    validation.
    victorlin committed Sep 6, 2024
    Configuration menu
    Copy the full SHA
    deb00d6 View commit details
    Browse the repository at this point in the history
  12. 🚧 filter: Add a --debug option

    Now that disk space usage is uncapped (compared to previous pandas
    implementation using in-memory chunks), it can be useful to know how
    large the database file was.
    
    However, I'll have to think more about how useful this is once database
    files are passed in by the user. Ideas:
    
    - Mark this as an experimental feature for `augur filter`, to be changed
      or removed with any version.
    - Add it to the `augur db` interface, e.g. output of `augur db inspect`.
      However, it can still be useful to know the sizes of "intermediate"
      tables.
    
    It'd also be useful to add runtime information here. Ideas:
    
    - Print run times of each "major" function in real-time. This can
      probably be achieved by some sort of decorator function.
    victorlin committed Sep 6, 2024
    Configuration menu
    Copy the full SHA
    0cd748e View commit details
    Browse the repository at this point in the history
  13. Configuration menu
    Copy the full SHA
    260343d View commit details
    Browse the repository at this point in the history
  14. 🚧 filter: Add --query-sqlite option

    This adds a new flag to query the SQLite database natively.
    `--query`/`--query-pandas` will still behave as expected.
    
    All Pandas-based query functions are renamed to be Pandas-specific.
    
    To avoid breaking changes, alias `--query` to `--query-pandas`.
    victorlin committed Sep 6, 2024
    Configuration menu
    Copy the full SHA
    a0a8e81 View commit details
    Browse the repository at this point in the history
  15. Configuration menu
    Copy the full SHA
    99c5a0d View commit details
    Browse the repository at this point in the history