Maintain an in-memory SQLite table of connected databases and their tables #1150

simonw · 2020-12-17T23:02:13Z

I want Datasette to have its own internal metadata about connected tables, to power features like a paginated searchable homepage in #461. I want this to be a SQLite table.

This could also be part of the directory scanning mechanism prototyped in #672 - where Datasette can be set to continually scan a directory for new database files that it can serve.

Also relevant to the Datasette Library concept in #417.

simonw · 2020-12-17T23:04:13Z

Pages that need a list of all databases - the index page and /-/databases for example - could trigger a "check for new directories in the configured directories" scan.

That scan would run at most once every 5 (n) seconds - the check is triggered if it’s run more recently than that it doesn’t run.

Hopefully this means it could be done as a blocking operation, rather than trying to run it in a thread.

When it runs it scans for *.db or *.sqlite files (maybe one or two other extensions) that it hasn’t seen before. It also checks that the existing list of known database files still exists.

If it finds any new ones it connects to them once to run .schema. It also runs PRAGMA schema_version on each known database so that it can compare the schema version number to the last one it saw. That's how it detects if there are new tables or if the cached schema needs to be updated.

simonw · 2020-12-17T23:04:38Z

Open question: will this work for hundreds of database files, or is the overhead of connecting to each of 100 databases in turn to run PRAGMA schema_version too high?

simonw · 2020-12-17T23:16:31Z

Quick micro-benchmark, run against a folder with 46 database files adding up to 1.4GB total:

import pathlib, sqlite3, time

paths = list(pathlib.Path(".").glob('*.db'))

def schema_version(path):
    db = sqlite3.connect(path)
    version = db.execute("PRAGMA schema_version").fetchall()[0]
    db.close()
    return version

def all():
    versions = {}
    for path in paths:
        versions[path.name] = schema_version(path)
    return versions

start = time.time(); all(); print(time.time() - start)
# 0.012346982955932617

So that's 12ms.

simonw · 2020-12-17T23:20:49Z

I tried against my entire ~/Development/Dropbox folder - deeply nested with 381 SQLite database files in sub-folders - and it took 25s! But it turned out 23.9s of that was the call to pathlib.Path("/Users/simon/Dropbox/Development").glob('**/*.db').

So it looks like connecting to a SQLite database file and getting the schema version is extremely fast. Scanning directories is slower.

simonw · 2020-12-17T23:22:41Z

It's just recursion that's expensive. I created 380 empty SQLite databases in a folder and timed list(pathlib.Path("/tmp").glob("*.db")); and it took 0.002s.

So maybe I tell users that all SQLite databases have to be in the root folder.

simonw · 2020-12-17T23:23:44Z

Grabbing the schema version of 380 files in the root directory takes 70ms.

simonw · 2020-12-17T23:24:03Z

I'm going to assume that even the heaviest user will have trouble going beyond a few hundred database files, so this is fine.

simonw · 2020-12-17T23:25:21Z

Next challenge: figure out how to use the Database class from https://github.com/simonw/datasette/blob/0.53/datasette/database.py for an in-memory database which persists data for the duration of the lifetime of the server, and allows access to that in-memory database from multiple threads in a way that lets them see each other's changes.

simonw · 2020-12-18T01:08:40Z

Next step: design a schema for the in-memory database table that exposes all of the tables.

I want to support things like:

Show me all of the tables
Show me the columns in a table
Show me all tables that contain a tags column
Show me the indexes
Show me every table configured for full-text search

Maybe a starting point would be to build concrete tables using the results of things like PRAGMA foreign_key_list(table) and PRAGMA table_xinfo(table) - note though that table_xinfo is SQLite 3.26.0 or higher, as shown here:

datasette/datasette/utils/__init__.py

Lines 563 to 579 in 5e9895c

    
           def table_column_details(conn, table): 
        
               if supports_table_xinfo(): 
        
                   # table_xinfo was added in 3.26.0 
        
                   return [ 
        
                       Column(*r) 
        
                       for r in conn.execute( 
        
                           f"PRAGMA table_xinfo({escape_sqlite(table)});" 
        
                       ).fetchall() 
        
                   ] 
        
               else: 
        
                   # Treat hidden as 0 for all columns 
        
                   return [ 
        
                       Column(*(list(r) + [0])) 
        
                       for r in conn.execute( 
        
                           f"PRAGMA table_info({escape_sqlite(table)});" 
        
                       ).fetchall() 
        
                   ]

simonw · 2020-12-18T01:12:13Z

Prototype: https://latest.datasette.io/fixtures?sql=select+%27facetable%27+as+%27table%27%2C+*+from+pragma_table_xinfo%28%27facetable%27%29%0D%0Aunion%0D%0Aselect+%27searchable%27+as+%27table%27%2C+*+from+pragma_table_xinfo%28%27searchable%27%29%0D%0Aunion%0D%0Aselect+%27compound_three_primary_keys%27+as+%27table%27%2C+*+from+pragma_table_xinfo%28%27compound_three_primary_keys%27%29

select 'facetable' as 'table', * from pragma_table_xinfo('facetable')
union
select 'searchable' as 'table', * from pragma_table_xinfo('searchable')
union
select 'compound_three_primary_keys' as 'table', * from pragma_table_xinfo('compound_three_primary_keys')

table	cid	name	type	pk
compound_three_primary_keys	0	pk1	varchar(30)	1
compound_three_primary_keys	1	pk2	varchar(30)	2
compound_three_primary_keys	2	pk3	varchar(30)	3
compound_three_primary_keys	3	content	text	0
facetable	0	pk	integer	1
facetable	1	created	text	0
facetable	2	planet_int	integer	0
facetable	3	on_earth	integer	0
facetable	4	state	text	0
facetable	5	city_id	integer	0
facetable	6	neighborhood	text	0
facetable	7	tags	text	0
facetable	8	complex_array	text	0
facetable	9	distinct_some_null		0
searchable	0	pk	integer	1
searchable	1	text1	text	0
searchable	2	text2	text	0
searchable	3	name with . and spaces	text	0

simonw · 2020-12-18T01:15:27Z

This query uses a join to pull foreign key information for every table: https://latest.datasette.io/fixtures?sql=with+tables+as+%28%0D%0A++select%0D%0A++++name%0D%0A++from%0D%0A++++sqlite_master%0D%0A++where%0D%0A++++type+%3D+%27table%27%0D%0A%29%0D%0Aselect%0D%0A++tables.name+as+%27table%27%2C%0D%0A++foo.*%0D%0Afrom%0D%0A++tables%0D%0A++join+pragma_foreign_key_list%28tables.name%29+foo

with tables as (
  select
    name
  from
    sqlite_master
  where
    type = 'table'
)
select
  tables.name as 'table',
  foo.*
from
  tables
  join pragma_foreign_key_list(tables.name) foo

Same query for pragma_table_xinfo: https://latest.datasette.io/fixtures?sql=with+tables+as+%28%0D%0A++select%0D%0A++++name%0D%0A++from%0D%0A++++sqlite_master%0D%0A++where%0D%0A++++type+%3D+%27table%27%0D%0A%29%0D%0Aselect%0D%0A++tables.name+as+%27table%27%2C%0D%0A++foo.*%0D%0Afrom%0D%0A++tables%0D%0A++join+pragma_table_xinfo%28tables.name%29+foo

simonw · 2020-12-18T01:22:05Z

Here's a simpler query pattern (not using CTEs so should work on older versions of SQLite) - this one lists all indexes for all tables:

select
  sqlite_master.name as 'table',
  indexes.*
from
  sqlite_master
  join pragma_index_list(sqlite_master.name) indexes
where
  sqlite_master.type = 'table'

https://latest.datasette.io/fixtures?sql=select%0D%0A++sqlite_master.name+as+%27table%27%2C%0D%0A++indexes.*%0D%0Afrom%0D%0A++sqlite_master%0D%0A++join+pragma_index_list%28sqlite_master.name%29+indexes%0D%0Awhere%0D%0A++sqlite_master.type+%3D+%27table%27

simonw · 2020-12-18T01:23:59Z

https://www.sqlite.org/pragma.html#pragfunc says:

This feature is experimental and is subject to change. Further documentation will become available if and when the table-valued functions for PRAGMAs feature becomes officially supported.

The table-valued functions for PRAGMA feature was added in SQLite version 3.16.0 (2017-01-02). Prior versions of SQLite cannot use this feature.

simonw · 2020-12-18T01:29:30Z

I've been rediscovering the pattern I already documented in this TIL: https://github.com/simonw/til/blob/main/sqlite/list-all-columns-in-a-database.md#better-alternative-using-a-join

simonw · 2020-12-18T02:49:40Z

I'm going to use five tables to start off with:

databases - a list of databases. Each one has a name, path (if it's on disk), is_memory, schema_version
tables - a list of tables. Each row is database_name, table_name, sql (the create table statement) - may add more tables in the future, in particular maybe a last_row_count to cache results of counting the rows.
columns - a list of columns. It's the output of pragma_table_xinfo with the database_name and table_name columns added at the beginning.
foreign_keys - a list of foreign keys - pragma_foreign_key_list output plus database_name and table_name.
indexes - a list of indexes - pragma_table_xinfo output plus database_name and table_name.

simonw · 2020-12-18T02:51:13Z

SQLite uses indexes rather than indices as the plural, so I'll go with that: https://sqlite.org/lang_createindex.html

simonw · 2020-12-18T02:52:19Z

Maintaining this database will be the responsibility of a subclass of Database called _SchemaDatabase which will be managed by the Datasette instance.

simonw · 2020-12-18T02:53:22Z

I think I'm going to have to build this without using the pragma_x() SQL functions as they were only added in 3.16 in 2017-01-02 and I've seen plenty of Datasette instances running on older versions of SQLite.

simonw · 2020-12-18T03:35:15Z

Simpler implementation idea: a Datasette method .refresh_schemas() which loops through all known databases, checks their schema version and updates the in-memory schemas database if they have changed.

simonw · 2020-12-18T03:36:04Z

I could have another table that stores the combined rows from sqlite_máster on every connected database so I have a copy of the schema SQL.

simonw · 2020-12-18T04:32:52Z

I need to figure out how this will interact with Datasette permissions.

If some tables are private, but others are public, should users be able to see the private tables listed in the schema metadata?

If not, how can that mechanism work?

simonw · 2020-12-18T04:33:45Z

One solution on permissions: if Datasette had an efficient way of saying "list the tables that this user has access to" I could use that as a filter any time the user views the schema information. The implementation could be tricky though.

simonw · 2020-12-18T04:35:34Z

I do need to solve the permissions problem properly though, because one of the goals of this system is to provide a paginated, searchable list of databases and tables for the homepage of the instance - #991.

As such, the homepage will need to be able to display only the tables and databases that the user has permission to view.

simonw · 2020-12-18T04:43:29Z

I may be overthinking that problem. Many queries are fast in SQLite. If a Datasette instance has 1,000 connected tables will even that be a performance problem for permission checks? I should benchmark to find out.

simonw · 2020-12-18T04:46:18Z

The homepage currently performs a massive flurry of permission checks - one for each, database, table and view: https://github.com/simonw/datasette/blob/0.53/datasette/views/index.py#L21-L75

A paginated version of this is a little daunting as the permission checks would have to be carried out in every single table just to calculate the count that will be paginated.

simonw · 2020-12-18T06:19:13Z

I'm not going to block this issue on permissions - I will tackle the efficient bulk permissions problem in #1152.

simonw · 2020-12-18T18:54:12Z

I'm going to tidy this up and land it. A couple of additional decisions:

The database will be called /_schemas
By default it will only be visible to root - thus avoiding having to solve the permissions problem with regards to users seeing schemas for tables that are otherwise invisible to them.

simonw · 2020-12-18T18:55:12Z

I'm going to move the code into a utils/schemas.py module, to avoid further extending the Datasette class definition and to make it more easily testable.

simonw · 2020-12-18T22:32:13Z

Getting all the tests to pass is tricky because this adds a whole extra database to Datasette - and there's various code that loops through ds.databases as part of the tests.

simonw · 2020-12-18T22:34:40Z

Needs documentation, but I can wait to write that until I've tested out the feature a bit more.

simonw · 2020-12-18T22:43:49Z

For a demo, visit https://latest.datasette.io/login-as-root and then hit https://latest.datasette.io/_schemas

noklam · 2020-12-27T14:51:39Z

I like the idea of _internal, it's a nice way to get a data catalog quickly. I wonder if this trick applies to db other than SQLite.

Refs #509, #1091, #1150, #1151, #1166, #1167, #1178, #1181, #1182, #1184, #1185, #1186, #1187, #1194, #1198

simonw added large feature labels Dec 17, 2020

simonw added this to the Datasette 1.0 milestone Dec 17, 2020

simonw mentioned this issue Dec 17, 2020

Database class mechanism for cross-connection in-memory databases #1151

Closed

simonw referenced this issue in simonw/til Dec 18, 2020

More functions this works with

f3ef4ad

simonw mentioned this issue Dec 18, 2020

Efficiently calculate list things a user can do / list of users who can do a thing #1152

Open

simonw closed this as completed in ebc7aa2 Dec 18, 2020

simonw mentioned this issue Dec 18, 2020

Documentation for new _internal database and tables #1154

Closed

simonw mentioned this issue Dec 21, 2020

Rename _schemas to _internal #1156

Closed

simonw mentioned this issue Jan 1, 2021

Mechanism for storing metadata in _metadata tables #1168

Open

This was referenced Jan 25, 2021

Release notes for Datasette 0.54 #1201

Closed

Release 0.54 #1206

Merged

simonw added a commit that referenced this issue Jan 25, 2021

Release 0.54

0b9ac1b

Refs #509, #1091, #1150, #1151, #1166, #1167, #1178, #1181, #1182, #1184, #1185, #1186, #1187, #1194, #1198

simonw mentioned this issue Feb 18, 2021

Race condition errors in new refresh_schemas() mechanism #1231

Closed

amitkoth mentioned this issue Nov 2, 2022

Datasette with many and large databases > Memory use #1880

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Maintain an in-memory SQLite table of connected databases and their tables #1150

Maintain an in-memory SQLite table of connected databases and their tables #1150

simonw commented Dec 17, 2020

simonw commented Dec 17, 2020

simonw commented Dec 17, 2020

simonw commented Dec 17, 2020

simonw commented Dec 17, 2020

simonw commented Dec 17, 2020

simonw commented Dec 17, 2020

simonw commented Dec 17, 2020

simonw commented Dec 17, 2020

simonw commented Dec 18, 2020

simonw commented Dec 18, 2020

simonw commented Dec 18, 2020 •

edited

Loading

simonw commented Dec 18, 2020

simonw commented Dec 18, 2020

simonw commented Dec 18, 2020

simonw commented Dec 18, 2020 •

edited

Loading

simonw commented Dec 18, 2020 •

edited

Loading

simonw commented Dec 18, 2020 •

edited

Loading

simonw commented Dec 18, 2020

simonw commented Dec 18, 2020

simonw commented Dec 18, 2020

simonw commented Dec 18, 2020

simonw commented Dec 18, 2020

simonw commented Dec 18, 2020

simonw commented Dec 18, 2020

simonw commented Dec 18, 2020

simonw commented Dec 18, 2020

simonw commented Dec 18, 2020

simonw commented Dec 18, 2020

simonw commented Dec 18, 2020

simonw commented Dec 18, 2020

simonw commented Dec 18, 2020

noklam commented Dec 27, 2020

Maintain an in-memory SQLite table of connected databases and their tables #1150

Maintain an in-memory SQLite table of connected databases and their tables #1150

Comments

simonw commented Dec 17, 2020

simonw commented Dec 17, 2020

simonw commented Dec 17, 2020

simonw commented Dec 17, 2020

simonw commented Dec 17, 2020

simonw commented Dec 17, 2020

simonw commented Dec 17, 2020

simonw commented Dec 17, 2020

simonw commented Dec 17, 2020

simonw commented Dec 18, 2020

simonw commented Dec 18, 2020

simonw commented Dec 18, 2020 • edited Loading

simonw commented Dec 18, 2020

simonw commented Dec 18, 2020

simonw commented Dec 18, 2020

simonw commented Dec 18, 2020 • edited Loading

simonw commented Dec 18, 2020 • edited Loading

simonw commented Dec 18, 2020 • edited Loading

simonw commented Dec 18, 2020

simonw commented Dec 18, 2020

simonw commented Dec 18, 2020

simonw commented Dec 18, 2020

simonw commented Dec 18, 2020

simonw commented Dec 18, 2020

simonw commented Dec 18, 2020

simonw commented Dec 18, 2020

simonw commented Dec 18, 2020

simonw commented Dec 18, 2020

simonw commented Dec 18, 2020

simonw commented Dec 18, 2020

simonw commented Dec 18, 2020

simonw commented Dec 18, 2020

noklam commented Dec 27, 2020

simonw commented Dec 18, 2020 •

edited

Loading

simonw commented Dec 18, 2020 •

edited

Loading

simonw commented Dec 18, 2020 •

edited

Loading

simonw commented Dec 18, 2020 •

edited

Loading