Support cross-database joins #283

simonw · 2018-05-24T04:18:39Z

SQLite has the ability to attach multiple databases to a single connection and then run joins across multiple databases.

Since Datasette supports more than one database, this would make a pretty neat feature.

simonw · 2018-05-24T04:21:49Z

The challenge here is which database should be the "default" database. The first database attached to SQLite is treated as the default - if no database is specified in a query, that's the database that queries will be executed against.

Currently, each database URL in Datasette (e.g. https://san-francisco.datasettes.com/sf-film-locations-84594a7 v.s. https://san-francisco.datasettes.com/sf-trees-ebc2ad9 ) gets its own independent connection, and all queries within that base URL run against that database.

If we're going to attach multiple databases to the same connection, how do we set which database gets to be the default?

The easiest thing to do here will be to have a special database (maybe which is turned off by default and can be enabled using datasette serve --enable-cross-database-joins or similar) which attaches to ALL the databases. Perhaps it starts as an in-memory database, maybe at /memory?

This is a quick-and-dirty proof of concept.

simonw · 2018-05-24T04:26:29Z

I built a very rough prototype of this to prove it could work. It's deployed here - and here's an example of a query that joins across two different databases:

https://datasette-cross-database-joins-prototype.now.sh/memory?sql=select+fivethirtyeight.%5Blove-actually%2Flove_actually_adjacencies%5D.rowid%2C%0D%0Afivethirtyeight.%5Blove-actually%2Flove_actually_adjacencies%5D.actors%2C%0D%0A%5Bgoogle-trends%5D.%5B20150430_UKDebate%5D.city%0D%0Afrom+fivethirtyeight.%5Blove-actually%2Flove_actually_adjacencies%5D%0D%0Ajoin+%5Bgoogle-trends%5D.%5B20150430_UKDebate%5D%0D%0A++on+%5Bgoogle-trends%5D.%5B20150430_UKDebate%5D.rowid+%3D+fivethirtyeight.%5Blove-actually%2Flove_actually_adjacencies%5D.rowid

select fivethirtyeight.[love-actually/love_actually_adjacencies].rowid,
fivethirtyeight.[love-actually/love_actually_adjacencies].actors,
[google-trends].[20150430_UKDebate].city
from fivethirtyeight.[love-actually/love_actually_adjacencies]
join [google-trends].[20150430_UKDebate]
  on [google-trends].[20150430_UKDebate].rowid = fivethirtyeight.[love-actually/love_actually_adjacencies].rowid

I deployed it like this:

datasette publish now --branch=cross-database-joins fivethirtyeight.db google-trends.db --name=datasette-cross-database-joins-prototype

simonw · 2018-05-24T04:28:20Z

I used some pretty ugly hacks, like faking an entire .inspect() block for the :memory: database just to get past the errors I was seeing. To ship this as a feature it will need quite a bit of code refactoring to make those hacks unnecessary.

datasette/datasette/views/database.py

Lines 18 to 26 in 7a3040f

    
           if name == "memory": 
        
               info = {"tables": { 
        
                   "sqlite_master": { 
        
                       "name": "sqlite_master", 
        
                       "hidden": False, 
        
                       "columns": [], 
        
                       "count": 0, 
        
                   } 
        
               }, "views": []}

simonw · 2018-05-24T04:29:40Z

Rather than stealing the /memory namespace for this it would be nicer if these cross-database joins could be executed at the very top-level URL of the Datasette instance - https://example.com/?sql=...

simonw · 2018-05-24T15:15:19Z

Most of the time Datasette is used with just a single database file. So maybe it makes sense for this option to be turned on by default and to ALWAYS be available on the Datasette instance homepage unless the user has explicitly disabled it.

simonw · 2018-05-24T15:15:51Z

This would make Datasett's SQL features a lot more instantly obvious to people who land on a homepage, which is probably a good thing.

simonw · 2018-05-24T15:16:25Z

Should this support canned queries too? I think it should, though that raises interesting questions regarding their URL structure.

simonw · 2018-05-24T15:17:10Z

Another option: give this the /-/all URL namespace.

simonw · 2018-05-24T15:21:37Z

Giving it /all/ would be easier since that way the existing URL routes (including canned queries) would all work... but I would have to teach it NOT to expect a database content hash on that URL.

Or maybe it should still have a content hash (to enable far-future cache expiry headers on query results) but the hash should be constructed out of all of the other database hashes concatenated together.

That way the URLs would be /all-5de27e3 and /all-5de27e3/canned-query-name

Only downside: this would make it impossible to have a database file with the name all.db. I think that's probably an OK trade-off. You could turn the feature off with a config flag if you really want to use that filename (for whatever reason).

How about /-all-5de27e3/ instead to avoid collisions?

simonw · 2018-05-24T15:23:37Z

On the /-all-5de27e3 page we can show the regular https://fivethirtyeight.datasettes.com/fivethirtyeight-5de27e3 interface but instead of the list of tables we can show a list of attached databases plus some help text showing how to construct a cross-database join.

simonw · 2018-05-24T15:27:42Z

For an example query that pre-populates that textarea... maybe a UNION that pulls the first 10 rows from the first table of each of the first two databases?

select * from (select rowid, actors from fivethirtyeight.[love-actually/love_actually_adjacencies] limit 10)
   union all
select * from (select rowid, city from [google-trends].[20150430_UKDebate] limit 10)

https://datasette-cross-database-joins-prototype.now.sh/memory?sql=select+*+from+%28select+rowid%2C+actors+from+fivethirtyeight.%5Blove-actually%2Flove_actually_adjacencies%5D+limit+10%29%0D%0A+++union+all%0D%0Aselect+*+from+%28select+rowid%2C+city+from+%5Bgoogle-trends%5D.%5B20150430_UKDebate%5D+limit+10%29

simonw · 2018-05-24T16:00:05Z

I like /-/all-5de27e3 for this (with /-/all redirecting to the correct hash)

simonw · 2019-10-02T23:02:15Z

I've been thinking pretty hard about this as part of #569. My big concerns are:

If I'm caching and reusing connections I need to worry about the different combinations - if I have four databases do I cache separate connections for the ("one", "two") AND ("two", "three") AND ("one", "three") and so on pairs?
How does the API and interface deal with instances where you have a database connected as the primary and you want to ATTACH another database and talk to that as well?

I think the best way to do this is to say that cross-database joins will only be available against the :memory: database. Maybe with an optional mode you can run like datasette --crossdb which causes every database to be ATTACHd to that connection with an alias so you can start running queries.

If this proves to be a problem when hundreds of files are attached to a Datasette Library instance (#417) then maybe cross database joins are handled (in that case) by the authenticated user selecting which ones to ?_attach= and detaching them at the end of the request. Also perhaps limit to joining across a maximum of 3 databases at once in this case.

I can probably avoid the scariest negative consequences of cross-database joins by having them turned off by default for signed-out users. The datasette-on-my-laptop or authenticated Datasette Library cases can be opt-in and can be a little less locked down.

simonw · 2019-11-09T21:49:51Z

Better idea: if you run Datasette in cross-database joining mode, all connections start out as memory connections and then have new databases attached to them on-demand.

All table view queries will be automatically rewritten to start SELECT db.table.one, db.table.two FROM db.table ...

simonw · 2019-11-09T21:51:41Z

It may turn out that we have to recommend NOT exposing a Datasette instance to the public with dozens of database files that has multi-db queries enabled - will need to load test to understand if this recommendation is needed or not.

rayvoelker · 2021-02-18T02:13:56Z

I was going ask you about this issue when we talk during your office-hours schedule this Friday, but was there any support ever added for doing this cross-database joining?

I have a use-case where could be pretty neat to do analysis using this tool on time-specific databases from snapshots

https://ilsweb.cincinnatilibrary.org/collection-analysis/

and thanks again for such an amazing tool!

simonw · 2021-02-18T05:56:30Z

I'm going to to try prototyping the --crossdb option that causes /_memory to connect to all databases as a starting point and see how well that works.

simonw · 2021-02-18T19:13:30Z

It turns out SQLite defaults to a maximum of 10 attached databases. This can be increased using a compile-time constant, but even with that it cannot be more than 62: https://stackoverflow.com/questions/9845448/attach-limit-10

simonw · 2021-02-18T19:15:37Z

select * from pragma_database_list(); is useful - shows all attached databases for the current connection.

simonw · 2021-02-18T19:44:02Z

For the moment I'm going to hard-code a SQLITE_LIMIT_ATTACHED=10 constant and only attach the first 10 databases to the _memory connection.

simonw · 2021-02-18T19:47:34Z

I have a working version now, moving development to a pull request.

simonw · 2021-02-18T22:06:14Z

The implementation in #1232 is ready to land. It's the simplest-thing-that-could-possibly-work: you can run datasette one.db two.db three.db --crossdb and then use the /_memory page to run joins across tables from multiple databases.

It only works on the first 10 databases that were passed to the command-line. This means that if you have a Datasette instance with hundreds of attached databases (see Datasette Library) this won't be particularly useful for you.

So... a better, future version of this feature would be one that lets you join across databases on command - maybe by hitting /_memory?attach=db1&attach=db2 to get a special connection.

Also worth noting: plugins that implement the prepare_connection() hook can attach additional databases - so if you need better, customized support for this one way to handle that would be with a custom plugin.

* Test for cross-database join, refs #283 * Warn if --crossdb used with more than 10 DBs, refs #283 * latest.datasette.io demo of --crossdb joins, refs #283 * Show attached databases on /_memory page, refs #283 * Documentation for cross-database queries, refs #283

simonw · 2021-02-18T22:16:46Z

Demo is now live here: https://latest.datasette.io/_memory

The documentation is at https://docs.datasette.io/en/latest/sql_queries.html#cross-database-queries - it links to this example query: https://latest.datasette.io/_memory?sql=select%0D%0A++%27fixtures%27+as+database%2C+*%0D%0Afrom%0D%0A++%5Bfixtures%5D.sqlite_master%0D%0Aunion%0D%0Aselect%0D%0A++%27extra_database%27+as+database%2C+*%0D%0Afrom%0D%0A++%5Bextra_database%5D.sqlite_master

simonw · 2021-02-19T02:10:21Z

This feature is now released! https://docs.datasette.io/en/stable/changelog.html#v0-55

justinpinkney · 2021-03-03T12:28:42Z

One note on using this pragma I got an error on starting datasette no such table: pragma_database_list.

I diagnosed this to an older version of sqlite3 (3.14.2) and upgrading to a newer version (3.34.2) fixed the issue.

simonw · 2021-06-06T09:40:18Z

One note on using this pragma I got an error on starting datasette no such table: pragma_database_list.

I diagnosed this to an older version of sqlite3 (3.14.2) and upgrading to a newer version (3.34.2) fixed the issue.

That issue is fixed in #1276.

simonw added the large label May 24, 2018

simonw added a commit that referenced this issue May 24, 2018

Rough prototype of cross-database-joins, refs #283

7a3040f

This is a quick-and-dirty proof of concept.

simonw mentioned this issue May 24, 2018

Ability to enable/disable specific features via --config #284

Closed

4 tasks

simonw added the feature label May 28, 2018

psychemedia mentioned this issue Mar 19, 2019

Linked Data(sette) #412

Open

simonw mentioned this issue Aug 17, 2019

More advanced connection pooling #569

Open

simonw added a commit that referenced this issue Feb 18, 2021

--crossdb option for joining across databases, refs #283

1c5d340

simonw mentioned this issue Feb 18, 2021

--crossdb option for joining across databases #1232

Merged

5 tasks

simonw added a commit that referenced this issue Feb 18, 2021

Test for cross-database join, refs #283

535b3bb

simonw added a commit that referenced this issue Feb 18, 2021

Warn if --crossdb used with more than 10 DBs, refs #283

b2c3201

simonw added a commit that referenced this issue Feb 18, 2021

latest.datasette.io demo of --crossdb joins, refs #283

5ea2d60

simonw added a commit that referenced this issue Feb 18, 2021

Show attached databases on /_memory page, refs #283

fa029ed

simonw added a commit that referenced this issue Feb 18, 2021

Documentation for cross-database queries, refs #283

8876499

simonw mentioned this issue Feb 18, 2021

Runtime support for ATTACHing multiple databases #1234

Open

simonw closed this as completed Feb 18, 2021

simonw mentioned this issue Feb 19, 2021

--attach command line option for attaching extra databases simonw/sqlite-utils#236

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support cross-database joins #283

Support cross-database joins #283

simonw commented May 24, 2018

simonw commented May 24, 2018

simonw commented May 24, 2018 •

edited

Loading

simonw commented May 24, 2018

simonw commented May 24, 2018

simonw commented May 24, 2018

simonw commented May 24, 2018

simonw commented May 24, 2018

simonw commented May 24, 2018

simonw commented May 24, 2018 •

edited

Loading

simonw commented May 24, 2018

simonw commented May 24, 2018

simonw commented May 24, 2018

simonw commented Oct 2, 2019

simonw commented Nov 9, 2019

simonw commented Nov 9, 2019

rayvoelker commented Feb 18, 2021

simonw commented Feb 18, 2021 •

edited

Loading

simonw commented Feb 18, 2021

simonw commented Feb 18, 2021

simonw commented Feb 18, 2021

simonw commented Feb 18, 2021

simonw commented Feb 18, 2021

simonw commented Feb 18, 2021

simonw commented Feb 19, 2021

justinpinkney commented Mar 3, 2021

simonw commented Jun 6, 2021

Support cross-database joins #283

Support cross-database joins #283

Comments

simonw commented May 24, 2018

simonw commented May 24, 2018

simonw commented May 24, 2018 • edited Loading

simonw commented May 24, 2018

simonw commented May 24, 2018

simonw commented May 24, 2018

simonw commented May 24, 2018

simonw commented May 24, 2018

simonw commented May 24, 2018

simonw commented May 24, 2018 • edited Loading

simonw commented May 24, 2018

simonw commented May 24, 2018

simonw commented May 24, 2018

simonw commented Oct 2, 2019

simonw commented Nov 9, 2019

simonw commented Nov 9, 2019

rayvoelker commented Feb 18, 2021

simonw commented Feb 18, 2021 • edited Loading

simonw commented Feb 18, 2021

simonw commented Feb 18, 2021

simonw commented Feb 18, 2021

simonw commented Feb 18, 2021

simonw commented Feb 18, 2021

simonw commented Feb 18, 2021

simonw commented Feb 19, 2021

justinpinkney commented Mar 3, 2021

simonw commented Jun 6, 2021

simonw commented May 24, 2018 •

edited

Loading

simonw commented May 24, 2018 •

edited

Loading

simonw commented Feb 18, 2021 •

edited

Loading