-
-
Notifications
You must be signed in to change notification settings - Fork 699
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Maintain an in-memory SQLite table of connected databases and their tables #1150
Comments
Pages that need a list of all databases - the index page and /-/databases for example - could trigger a "check for new directories in the configured directories" scan. That scan would run at most once every 5 (n) seconds - the check is triggered if it’s run more recently than that it doesn’t run. Hopefully this means it could be done as a blocking operation, rather than trying to run it in a thread. When it runs it scans for *.db or *.sqlite files (maybe one or two other extensions) that it hasn’t seen before. It also checks that the existing list of known database files still exists. If it finds any new ones it connects to them once to run |
Open question: will this work for hundreds of database files, or is the overhead of connecting to each of 100 databases in turn to run |
Quick micro-benchmark, run against a folder with 46 database files adding up to 1.4GB total: import pathlib, sqlite3, time
paths = list(pathlib.Path(".").glob('*.db'))
def schema_version(path):
db = sqlite3.connect(path)
version = db.execute("PRAGMA schema_version").fetchall()[0]
db.close()
return version
def all():
versions = {}
for path in paths:
versions[path.name] = schema_version(path)
return versions
start = time.time(); all(); print(time.time() - start)
# 0.012346982955932617 So that's 12ms. |
I tried against my entire So it looks like connecting to a SQLite database file and getting the schema version is extremely fast. Scanning directories is slower. |
It's just recursion that's expensive. I created 380 empty SQLite databases in a folder and timed So maybe I tell users that all SQLite databases have to be in the root folder. |
Grabbing the schema version of 380 files in the root directory takes 70ms. |
I'm going to assume that even the heaviest user will have trouble going beyond a few hundred database files, so this is fine. |
Next challenge: figure out how to use the |
Next step: design a schema for the in-memory database table that exposes all of the tables. I want to support things like:
Maybe a starting point would be to build concrete tables using the results of things like datasette/datasette/utils/__init__.py Lines 563 to 579 in 5e9895c
|
select 'facetable' as 'table', * from pragma_table_xinfo('facetable')
union
select 'searchable' as 'table', * from pragma_table_xinfo('searchable')
union
select 'compound_three_primary_keys' as 'table', * from pragma_table_xinfo('compound_three_primary_keys')
|
This query uses a join to pull foreign key information for every table: https://latest.datasette.io/fixtures?sql=with+tables+as+%28%0D%0A++select%0D%0A++++name%0D%0A++from%0D%0A++++sqlite_master%0D%0A++where%0D%0A++++type+%3D+%27table%27%0D%0A%29%0D%0Aselect%0D%0A++tables.name+as+%27table%27%2C%0D%0A++foo.*%0D%0Afrom%0D%0A++tables%0D%0A++join+pragma_foreign_key_list%28tables.name%29+foo with tables as (
select
name
from
sqlite_master
where
type = 'table'
)
select
tables.name as 'table',
foo.*
from
tables
join pragma_foreign_key_list(tables.name) foo Same query for |
Here's a simpler query pattern (not using CTEs so should work on older versions of SQLite) - this one lists all indexes for all tables: select
sqlite_master.name as 'table',
indexes.*
from
sqlite_master
join pragma_index_list(sqlite_master.name) indexes
where
sqlite_master.type = 'table' |
https://www.sqlite.org/pragma.html#pragfunc says:
|
I've been rediscovering the pattern I already documented in this TIL: https://github.com/simonw/til/blob/main/sqlite/list-all-columns-in-a-database.md#better-alternative-using-a-join |
I'm going to use five tables to start off with:
|
SQLite uses |
Maintaining this database will be the responsibility of a subclass of |
I think I'm going to have to build this without using the |
Simpler implementation idea: a Datasette method |
I could have another table that stores the combined rows from |
I need to figure out how this will interact with Datasette permissions. If some tables are private, but others are public, should users be able to see the private tables listed in the schema metadata? If not, how can that mechanism work? |
One solution on permissions: if Datasette had an efficient way of saying "list the tables that this user has access to" I could use that as a filter any time the user views the schema information. The implementation could be tricky though. |
I do need to solve the permissions problem properly though, because one of the goals of this system is to provide a paginated, searchable list of databases and tables for the homepage of the instance - #991. As such, the homepage will need to be able to display only the tables and databases that the user has permission to view. |
I may be overthinking that problem. Many queries are fast in SQLite. If a Datasette instance has 1,000 connected tables will even that be a performance problem for permission checks? I should benchmark to find out. |
The homepage currently performs a massive flurry of permission checks - one for each, database, table and view: https://github.com/simonw/datasette/blob/0.53/datasette/views/index.py#L21-L75 A paginated version of this is a little daunting as the permission checks would have to be carried out in every single table just to calculate the count that will be paginated. |
I'm not going to block this issue on permissions - I will tackle the efficient bulk permissions problem in #1152. |
I'm going to tidy this up and land it. A couple of additional decisions:
|
I'm going to move the code into a |
Getting all the tests to pass is tricky because this adds a whole extra database to Datasette - and there's various code that loops through |
Needs documentation, but I can wait to write that until I've tested out the feature a bit more. |
For a demo, visit https://latest.datasette.io/login-as-root and then hit https://latest.datasette.io/_schemas |
I like the idea of _internal, it's a nice way to get a data catalog quickly. I wonder if this trick applies to db other than SQLite. |
I want Datasette to have its own internal metadata about connected tables, to power features like a paginated searchable homepage in #461. I want this to be a SQLite table.
This could also be part of the directory scanning mechanism prototyped in #672 - where Datasette can be set to continually scan a directory for new database files that it can serve.
Also relevant to the Datasette Library concept in #417.
The text was updated successfully, but these errors were encountered: