-
Notifications
You must be signed in to change notification settings - Fork 129
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support databases #866
Comments
Thank you writing this up, @victorlin! For the CLI, we should consider continuing to support separate
This kind of interface could be extended to support databases (and TSVs, etc.) like so:
Obviously the issue is complicated if we expect sequences and metadata to be in the same database to maintain internal consistency. The interface above could become frustrating, confusing, and/or inconsistent if users had to specify database paths multiple times like:
Whatever interface we choose, we should consider how/whether to abstract the type of input data from how it is being stored. For example, @victorlin, @tsibley, and I discussed this a bit previously, but we should also consider what internal interface we want for We currently pass around pandas DataFrames internally, which strongly couples our internal use of metadata to how they are stored. It is easy to imagine changing all of this code to work with database query sets instead such that all of our code strongly coupled to the database implementation. We might consider using factories that implement methods to generate |
Update: after experimenting with loading sequence data into the database (06e5698), it turns out to be unnecessary overhead for |
I like the idea of expanding This sequence of features might work:
The work in #854 replaces the internal logic in augur/augur/filter_support/db/sqlite.py Lines 18 to 25 in df83e1b
With this approach, the record ID column (e.g. Lines 1400 to 1404 in b9e0c79
I'm not familiar with other I also thought about dataclasses/sqlalchemy in #854 but decided not to pursue the additional complexity. Would be happy to reconsider during review or as a next step. This would also help in supporting non-SQLite databases, since #854 is using raw SQL queries/commands with SQLite-specific syntax. |
A couple thoughts, re-reading thru this now. I generally concur with most things above. +1 for supporting remote URLs ( +1 for a first step of detecting local SQLite files given to |
I just wrote a comment summarizing the current state of things on Thinking more broadly, it might not be a bad idea to begin work on the database loading command in parallel to refining #854. This allows people to get a feel for how SQL queries work on metadata/sequences. This could look something like: augur db load \
--metadata metadata.tsv \
--sequences sequences.fasta \
--to-sqlite3 data.db
sqlite3 data.db "SELECT * FROM metadata WHERE region == 'europe'" > metadata-subset.tsv EDIT: Started on this in #1094. |
@tsibley and I just discussed this in our 1:1 chat. Next steps:
|
Context
The increasing amount of data has shed light on limitations of the current file-based approach (#789). There have been solutions implemented, mostly in
augur filter
, to improve memory usage and optimize IO speed. There have also been some discussions of using a database to better handle large datasets, but no clear solution yet. More info on this Nextstrain team google doc.Assessment
Comparing a database approach against the current usage of pandas and Bio.SeqIO:
metadata.tsv
andsequences.fasta
..tsv
and.fasta
files (no regression in current interface).--engine
#854 requires data to be loaded into tables with certain names.Proposed iterative solution
1. Use a database file within
augur filter
(#854)augur filter
.2. Support populating/loading an existing database file
Desired usage:
augur filter --metadata
to accepts a SQLite database file. If present, it will skip the metadata/sequence loading step.augur index
for sequences.3. Use
sqlite3
in other Augur commands that read tabular dataaugur filter
.4. Pass database files between Augur commands (?)
far out, likely to change
--output-data
toaugur filter
.--data
and/or--output-data
to other Augur commands.5. Support remote databases (?)
far out, likely to change
augur filter –-database "postgresql://user:pass@some.postgresql.endpoint:5432/database" ...
Steps with unclear placement in solution
Relevant past and present work
--engine
#854augur db
with import/export of metadata #1094The text was updated successfully, but these errors were encountered: