-
Notifications
You must be signed in to change notification settings - Fork 129
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
filter: Rewrite using SQLite3 #1242
base: master
Are you sure you want to change the base?
Commits on Sep 6, 2024
-
io: Add class interfaces for working with SQLite3
This will be used in future commits by a new implementation of augur filter.
Configuration menu - View commit details
-
Copy full SHA for 068a216 - Browse repository at this point
Copy the full SHA 068a216View commit details -
io: Split Metadata class into TabularFile and File
These will be used in future commits by a new implementation of augur filter.
Configuration menu - View commit details
-
Copy full SHA for 28f49a9 - Browse repository at this point
Copy the full SHA 28f49a9View commit details -
Allow custom delimiter, header, columns to TabularFile
To be used in future commits.
Configuration menu - View commit details
-
Copy full SHA for acc4913 - Browse repository at this point
Copy the full SHA acc4913View commit details -
io/sqlite3: Add methods to manage primary indexes
This serves multiple purposes: 1. Detect duplicates. 2. Speed up strain-based queries. 3. Provide a marker for a single primary index column (akin to a pandas DataFrame index column). Also adds DuplicateError, a new exception class to expose duplicates for custom error messages. This will be used in a future commit by a new implementation of augur filter.
Configuration menu - View commit details
-
Copy full SHA for e5348e8 - Browse repository at this point
Copy the full SHA e5348e8View commit details -
io/sqlite3: Add debugging function
This is unused but can come in handy when debugging a query.
Configuration menu - View commit details
-
Copy full SHA for 0c71334 - Browse repository at this point
Copy the full SHA 0c71334View commit details -
dates: Allow ambiguity resolution method in get_numerical_date_from_v…
…alue() This will be used in a future commit by a new implementation of augur filter.
Configuration menu - View commit details
-
Copy full SHA for 22b8054 - Browse repository at this point
Copy the full SHA 22b8054View commit details -
filter/run: Split into smaller functions by using shared variables in…
… constants.py These variables are temporary to bridge the transition from using shared variables to a shared database. They will be removed in a future commit.
Configuration menu - View commit details
-
Copy full SHA for 9f08218 - Browse repository at this point
Copy the full SHA 9f08218View commit details -
(1/3) filter: Rewrite using SQLite3
WARNING: This change won't work as-is. Broken out as a separate commit for easier review. Remove metadata input and examples in include_exclude_rules. These are not applicable in the SQLite-based implementation.
Configuration menu - View commit details
-
Copy full SHA for 2e46b3a - Browse repository at this point
Copy the full SHA 2e46b3aView commit details -
🚧 (2/3) filter: Rewrite using SQLite3
WARNING: This change won't work as-is. Broken out as a separate commit for easier review. Remove sequence index parameter in include_exclude_rules. These are not applicable in the SQLite-based implementation. It also removes the need to check for DataFrame instances in kwargs.
Configuration menu - View commit details
-
Copy full SHA for 237273b - Browse repository at this point
Copy the full SHA 237273bView commit details -
🚧 (3/3) filter: Rewrite using SQLite3
This is the final change that replaces the Pandas-based implementation of augur filter with a SQLite3-based implementation. Breaking changes in behavior (see changes under tests/): 1. `--subsample-seed` is still deterministic but differs from previous implementation. 2. `--include*`: Previously, both exclusion and force-inclusion would be shown in console output and `--output-log`. Now, only the force-inclusion is reported. Implementation differences with no functional changes: 1. Tabular files are loaded into tables in a temporary SQLite3 database file on disk rather than Pandas DataFrames. This generally means less memory usage and more disk usage. Tables are indexed on strain. 2. Since chunked loading of metadata was introduced to avoid high memory usage¹, that is no longer necessary and all operations are now on the entire metadata (except for `--query`/`--query-pandas`). 3. For large datasets, the SQLite3 implementation is much faster than the Pandas implementation. 4. Instead of relying on continually updated variables (e.g. `valid_strains`), new tables in the database are created at various stages in the process. The "filter reason table" is used as a source of truth for all outputs (and is conveniently a ~direct representation of `--output-log`). This also allows the function `augur.filter._run.run()` to be further broken into smaller parts. 5. Exclusion/inclusion is done using WHERE expressions. 6. For subsampling, priority queues are no longer necessary, as the the highest priority strains can be determined using a ORDER BY across all strains. 7. Date parsing has improved with caching and a a min/max approach to resolving date ranges. Note that sequence I/O remains unchanged. ¹ 87ca73c
Configuration menu - View commit details
-
Copy full SHA for fdf131b - Browse repository at this point
Copy the full SHA fdf131bView commit details -
dates: Remove unused is_date_ambiguous()
The re-implementation of augur filter accounts for this directly in filter_by_ambiguous_date(). Remove old tests and add equivalent tests. Note that some date strings have been changed since the previous values would fail date format validation.
Configuration menu - View commit details
-
Copy full SHA for deb00d6 - Browse repository at this point
Copy the full SHA deb00d6View commit details -
🚧 filter: Add a --debug option
Now that disk space usage is uncapped (compared to previous pandas implementation using in-memory chunks), it can be useful to know how large the database file was. However, I'll have to think more about how useful this is once database files are passed in by the user. Ideas: - Mark this as an experimental feature for `augur filter`, to be changed or removed with any version. - Add it to the `augur db` interface, e.g. output of `augur db inspect`. However, it can still be useful to know the sizes of "intermediate" tables. It'd also be useful to add runtime information here. Ideas: - Print run times of each "major" function in real-time. This can probably be achieved by some sort of decorator function.
Configuration menu - View commit details
-
Copy full SHA for 0cd748e - Browse repository at this point
Copy the full SHA 0cd748eView commit details -
Configuration menu - View commit details
-
Copy full SHA for 260343d - Browse repository at this point
Copy the full SHA 260343dView commit details -
🚧 filter: Add --query-sqlite option
This adds a new flag to query the SQLite database natively. `--query`/`--query-pandas` will still behave as expected. All Pandas-based query functions are renamed to be Pandas-specific. To avoid breaking changes, alias `--query` to `--query-pandas`.
Configuration menu - View commit details
-
Copy full SHA for a0a8e81 - Browse repository at this point
Copy the full SHA a0a8e81View commit details -
Configuration menu - View commit details
-
Copy full SHA for 99c5a0d - Browse repository at this point
Copy the full SHA 99c5a0dView commit details