Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prototoype for Datasette on PostgreSQL #670

Open
simonw opened this issue Feb 13, 2020 · 15 comments
Open

Prototoype for Datasette on PostgreSQL #670

simonw opened this issue Feb 13, 2020 · 15 comments
Labels

Comments

@simonw
Copy link
Owner

simonw commented Feb 13, 2020

I thought this would never happen, but now that I'm deep in the weeds of running SQLite in production for Datasette Cloud I'm starting to reconsider my policy of only supporting SQLite.

Some of the factors making me think PostgreSQL support could be worth the effort:

  • Serverless. I'm getting increasingly excited about writable-database use-cases for Datasette. If it could talk to PostgreSQL then users could easily deploy it on Heroku or other serverless providers that can talk to a managed RDS-style PostgreSQL.
  • Existing databases. Plenty of organizations have PostgreSQL databases. They can export to SQLite using db-to-sqlite but that's a pretty big barrier to getting started - being able to run datasette postgresql://connection-string and start trying it out would be a massively better experience.
  • Data size. I keep running into use-cases where I want to run Datasette against many GBs of data. SQLite can do this but PostgreSQL is much more optimized for large data, especially given the existence of tools like Citus.
  • Marketing. Convincing people to trust their data to SQLite is potentially a big barrier to adoption. Even if I've convinced myself it's trustworthy I still have to convince everyone else.
  • It might not be that hard? If this required a ground-up rewrite it wouldn't be worth the effort, but I have a hunch that it may not be too hard - most of the SQL in Datasette should work on both databases since it's almost all portable SELECT statements. If Datasette did DML this would be a lot harder, but it doesn't.
  • Plugins! This feels like a natural surface for a plugin - at which point people could add MySQL support and suchlike in the future.

The above reasons feel strong enough to justify a prototype.

@simonw
Copy link
Owner Author

simonw commented Feb 13, 2020

I'm excited about https://github.com/MagicStack/asyncpg for this - it's a true async PostgreSQL library (my SQLite queries run in a threadpool right now) with extremely impressive performance benchmarks, from the team behind uvloop.

@simonw
Copy link
Owner Author

simonw commented Feb 13, 2020

A couple of things I'd like to support:

  • The same datasette instance can have both PostgreSQL and SQLite databases attached to it, and both types will be listed on the homepage.
  • The full test suite runs against both SQLite and PostgreSQL, with as few changes as possible (maybe a few skipIf() decorators for features not available on both). Probably easiest do do this with some kind of flag - e.g. pytest --database=sqlite v.s. pytest --database=postgresql

I can implement that with this in conftest.py:

def pytest_addoption(parser):
    parser.addoption("--database", action="store", default="sqlite")

See https://stackoverflow.com/questions/40880259/how-to-pass-arguments-in-pytest-by-command-line

@simonw
Copy link
Owner Author

simonw commented Feb 13, 2020

The biggest difference between the two will be around introspection. I searched the codebase for potential introspection queries, defined as select ... from sqlite_master and pragma ... statements.

  • 7 results for sqlite_master in database.py
  • 3 results in utils/init.py
  • (Ignoring the 4 in inspect.py)
  • 1 result for “pragma table_info” in app.py
  • 2 in utils/init.py
  • 2 results for “pragma foreign_key_list” in utils/init.py

@simonw
Copy link
Owner Author

simonw commented Feb 13, 2020

First page to get working: the database view page, which shows a list of tables.

In the code this is entirely implemented with calls to the Database class, which is a good starting point - I can drop in a new version of that class that talks to PostgreSQL instead:

db = self.ds.databases[database]
table_counts = await db.table_counts(5)
views = await db.view_names()
hidden_table_names = set(await db.hidden_table_names())
all_foreign_keys = await db.get_all_foreign_keys()
tables = []
for table in table_counts:
table_columns = await db.table_columns(table)
tables.append(
{
"name": table,
"columns": table_columns,
"primary_keys": await db.primary_keys(table),
"count": table_counts[table],
"hidden": table in hidden_table_names,
"fts_table": await db.fts_table(table),
"foreign_keys": all_foreign_keys[table],
}
)
tables.sort(key=lambda t: (t["hidden"], t["name"]))
return (
{
"database": database,
"size": db.size,
"tables": tables,
"hidden_count": len([t for t in tables if t["hidden"]]),
"views": views,
"queries": self.ds.get_canned_queries(database),
},
{
"show_hidden": request.args.get("_show_hidden"),
"editable": True,
"metadata": metadata,
"allow_download": self.ds.config("allow_download")
and not db.is_mutable
and database != ":memory:",
},
("database-{}.html".format(to_css_class(database)), "database.html"),
)

@simonw
Copy link
Owner Author

simonw commented Feb 13, 2020

Introspecting columns in PostgreSQL:

In [3]: sql = """SELECT * 
   ...:   FROM information_schema.columns 
   ...:  WHERE table_schema = 'public' 
   ...:    AND table_name   = 'blog_blogmark' 
   ...:      ;"""                                                                                                                                                     

Each column looks like this:

{'table_catalog': 'simonwillisonblog',
 'table_schema': 'public',
 'table_name': 'blog_blogmark',
 'column_name': 'id',
 'ordinal_position': 1,
 'column_default': "nextval('blog_blogmark_id_seq'::regclass)",
 'is_nullable': 'NO',
 'data_type': 'integer',
 'character_maximum_length': None,
 'character_octet_length': None,
 'numeric_precision': 32,
 'numeric_precision_radix': 2,
 'numeric_scale': 0,
 'datetime_precision': None,
 'interval_type': None,
 'interval_precision': None,
 'character_set_catalog': None,
 'character_set_schema': None,
 'character_set_name': None,
 'collation_catalog': None,
 'collation_schema': None,
 'collation_name': None,
 'domain_catalog': None,
 'domain_schema': None,
 'domain_name': None,
 'udt_catalog': 'simonwillisonblog',
 'udt_schema': 'pg_catalog',
 'udt_name': 'int4',
 'scope_catalog': None,
 'scope_schema': None,
 'scope_name': None,
 'maximum_cardinality': None,
 'dtd_identifier': '1',
 'is_self_referencing': 'NO',
 'is_identity': 'NO',
 'identity_generation': None,
 'identity_start': None,
 'identity_increment': None,
 'identity_maximum': None,
 'identity_minimum': None,
 'identity_cycle': 'NO',
 'is_generated': 'NEVER',
 'generation_expression': None,
 'is_updatable': 'YES'}

@simonw
Copy link
Owner Author

simonw commented Feb 13, 2020

Finding out the primary keys for a table: https://wiki.postgresql.org/wiki/Retrieve_primary_key_columns

@simonw
Copy link
Owner Author

simonw commented Feb 13, 2020

In [12]: await conn.fetch("""SELECT a.attname, format_type(a.atttypid, a.atttypmod) AS data_type 
    ...: FROM   pg_index i 
    ...: JOIN   pg_attribute a ON a.attrelid = i.indrelid 
    ...:                      AND a.attnum = ANY(i.indkey) 
    ...: WHERE  i.indrelid = 'blog_blogmark'::regclass 
    ...: AND    i.indisprimary;""")                                                                                                                                   
Out[12]: [<Record attname='id' data_type='integer'>]

@simonw
Copy link
Owner Author

simonw commented Feb 13, 2020

Challenge: what's the equivalent for PostgreSQL of opening a database in read only mode? Will I have to talk users through creating read only credentials? Can I do this at runtime somehow? Can I detect if the connection has write permission and disable the arbitrary query feature?

simonw added a commit that referenced this issue Feb 13, 2020
This prototype demonstrates the database page working against a
hard-coded connection string to a PostgreSQL database. It lists
tables and their columns and their row count.,
@simonw
Copy link
Owner Author

simonw commented Feb 13, 2020

Got the database page working! It lists tables, their columns and their row count.

Got the table page partially working! It can list rows.

It can't apply filters yet (because PostgreSQL $1 parameters differ from SQLite :blah parameters) and faceting doesn't work because PostgreSQL requires that subqueries have an alias:

subquery in FROM must have an alias HINT: For example, FROM (SELECT ...) [AS] foo.

Still a pretty promising start though!

@snth
Copy link

snth commented Sep 21, 2020

I'm currently using PostgREST to serve OpenAPI APIs off Postgresql databases. I would like to try out datasette once this becomes available on Postgres.

@simonw
Copy link
Owner Author

simonw commented Mar 12, 2021

Challenge: what's the equivalent for PostgreSQL of opening a database in read only mode? Will I have to talk users through creating read only credentials?

It looks like the answer to this is yes - I'll need users to setup read-only credentials. Here's a TIL about that: https://til.simonwillison.net/postgresql/read-only-postgresql-user

@publicmatt
Copy link

I've also been investigating serving postgresql databases over postgrest. I like the idea of hosting some static html + js on github, but having it backed by datasets I can update and control on the database server. I started from SQLite + datasette but would like to host larger datasets (with smaller materialized views exposed publicly). I think the postgrest model where all the authorization and ownership is defined in database role grants is really powerful. But I really miss being able to define an ad-hoc query in sql, then instantly link to a json representation of it like datasette does.

P.S.: I've been sort of following along as you pop up in hacker news here and there. It's been great! Thanks for doing this all out in the open!

@simonw
Copy link
Owner Author

simonw commented May 26, 2021

Worth mentioning here: I've been doing a tun of research around running Datasette-like functionality against PostgreSQL in my https://github.com/simonw/django-sql-dashboard project - which will definitely inform the Datasette implementation.

@yairlenga
Copy link

Is there any working version of datasette/postgresql ?

@tf13
Copy link

tf13 commented Nov 17, 2023

Any progress on this? It would be very helpful on my end as well. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

5 participants