-
-
Notifications
You must be signed in to change notification settings - Fork 700
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support cross-database joins #283
Comments
The challenge here is which database should be the "default" database. The first database attached to SQLite is treated as the default - if no database is specified in a query, that's the database that queries will be executed against. Currently, each database URL in Datasette (e.g. https://san-francisco.datasettes.com/sf-film-locations-84594a7 v.s. https://san-francisco.datasettes.com/sf-trees-ebc2ad9 ) gets its own independent connection, and all queries within that base URL run against that database. If we're going to attach multiple databases to the same connection, how do we set which database gets to be the default? The easiest thing to do here will be to have a special database (maybe which is turned off by default and can be enabled using |
This is a quick-and-dirty proof of concept.
I built a very rough prototype of this to prove it could work. It's deployed here - and here's an example of a query that joins across two different databases:
I deployed it like this:
|
I used some pretty ugly hacks, like faking an entire datasette/datasette/views/database.py Lines 18 to 26 in 7a3040f
|
Rather than stealing the |
Most of the time Datasette is used with just a single database file. So maybe it makes sense for this option to be turned on by default and to ALWAYS be available on the Datasette instance homepage unless the user has explicitly disabled it. |
This would make Datasett's SQL features a lot more instantly obvious to people who land on a homepage, which is probably a good thing. |
Should this support canned queries too? I think it should, though that raises interesting questions regarding their URL structure. |
Another option: give this the |
Giving it Or maybe it should still have a content hash (to enable far-future cache expiry headers on query results) but the hash should be constructed out of all of the other database hashes concatenated together. That way the URLs would be Only downside: this would make it impossible to have a database file with the name How about |
On the |
For an example query that pre-populates that textarea... maybe a UNION that pulls the first 10 rows from the first table of each of the first two databases?
|
I like |
I've been thinking pretty hard about this as part of #569. My big concerns are:
I think the best way to do this is to say that cross-database joins will only be available against the If this proves to be a problem when hundreds of files are attached to a Datasette Library instance (#417) then maybe cross database joins are handled (in that case) by the authenticated user selecting which ones to ?_attach= and detaching them at the end of the request. Also perhaps limit to joining across a maximum of 3 databases at once in this case. I can probably avoid the scariest negative consequences of cross-database joins by having them turned off by default for signed-out users. The datasette-on-my-laptop or authenticated Datasette Library cases can be opt-in and can be a little less locked down. |
Better idea: if you run Datasette in cross-database joining mode, all connections start out as memory connections and then have new databases attached to them on-demand. All table view queries will be automatically rewritten to start |
It may turn out that we have to recommend NOT exposing a Datasette instance to the public with dozens of database files that has multi-db queries enabled - will need to load test to understand if this recommendation is needed or not. |
I was going ask you about this issue when we talk during your office-hours schedule this Friday, but was there any support ever added for doing this cross-database joining? I have a use-case where could be pretty neat to do analysis using this tool on time-specific databases from snapshots https://ilsweb.cincinnatilibrary.org/collection-analysis/ and thanks again for such an amazing tool! |
I'm going to to try prototyping the |
It turns out SQLite defaults to a maximum of 10 attached databases. This can be increased using a compile-time constant, but even with that it cannot be more than 62: https://stackoverflow.com/questions/9845448/attach-limit-10 |
|
For the moment I'm going to hard-code a |
I have a working version now, moving development to a pull request. |
The implementation in #1232 is ready to land. It's the simplest-thing-that-could-possibly-work: you can run It only works on the first 10 databases that were passed to the command-line. This means that if you have a Datasette instance with hundreds of attached databases (see Datasette Library) this won't be particularly useful for you. So... a better, future version of this feature would be one that lets you join across databases on command - maybe by hitting Also worth noting: plugins that implement the prepare_connection() hook can attach additional databases - so if you need better, customized support for this one way to handle that would be with a custom plugin. |
This feature is now released! https://docs.datasette.io/en/stable/changelog.html#v0-55 |
One note on using this pragma I got an error on starting datasette I diagnosed this to an older version of sqlite3 (3.14.2) and upgrading to a newer version (3.34.2) fixed the issue. |
That issue is fixed in #1276. |
SQLite has the ability to attach multiple databases to a single connection and then run joins across multiple databases.
Since Datasette supports more than one database, this would make a pretty neat feature.
The text was updated successfully, but these errors were encountered: