Add insert --truncate option #118

tsibley · 2020-07-06T21:58:40Z

Deletes all rows in the table (if it exists) before inserting new rows.
SQLite doesn't implement a TRUNCATE TABLE statement but does optimize an
unqualified DELETE FROM.

This can be handy if you want to refresh the entire contents of a table
but a) don't have a PK (so can't use --replace), b) don't want the table
to disappear (even briefly) for other connections, and c) have to handle
records that used to exist being deleted.

Ideally the replacement of rows would appear instantaneous to other
connections by putting the DELETE + INSERT in a transaction, but this is
very difficult without breaking other code as the current transaction
handling is inconsistent and non-systematic. There exists the
possibility for the DELETE to succeed but the INSERT to fail, leaving an
empty table. This is not much worse, however, than the current
possibility of one chunked INSERT succeeding and being committed while
the next chunked INSERT fails, leaving a partially complete operation.

tsibley · 2020-07-07T17:41:06Z

Hmm, while tests pass, this may not work as intended on larger datasets. Looking into it.

tsibley · 2020-07-07T18:45:23Z

Ah, I see the problem. The truncate is inside a loop I didn't realize was there.

tsibley · 2020-07-08T02:16:42Z

I fixed my original oops by moving the DELETE FROM $table out of the chunking loop and repushed. I think this change can be considered in isolation from issues around transactions, which I discuss next.

I wanted to make the DELETE + INSERT happen all in the same transaction so it was robust, but that was more complicated than I expected. The transaction handling in the Database/Table classes isn't systematic, and this poses big hurdles to making Table.insert_all (or other operations) consistent and robust in the face of errors.

For example, I wanted to do this (whitespace ignored in diff, so indentation change not highlighted):

diff --git a/sqlite_utils/db.py b/sqlite_utils/db.py
index d6b9ecf..4107ceb 100644
--- a/sqlite_utils/db.py
+++ b/sqlite_utils/db.py
@@ -1028,6 +1028,11 @@ class Table(Queryable):
         batch_size = max(1, min(batch_size, SQLITE_MAX_VARS // num_columns))
         self.last_rowid = None
         self.last_pk = None
+        with self.db.conn:
+            # Explicit BEGIN is necessary because Python's sqlite3 doesn't
+            # issue implicit BEGINs for DDL, only DML.  We mix DDL and DML
+            # below and might execute DDL first, e.g. for table creation.
+            self.db.conn.execute("BEGIN")
             if truncate and self.exists():
                 self.db.conn.execute("DELETE FROM [{}];".format(self.name))
             for chunk in chunks(itertools.chain([first_record], records), batch_size):
@@ -1038,7 +1043,11 @@ class Table(Queryable):
                         # Use the first batch to derive the table names
                         column_types = suggest_column_types(chunk)
                         column_types.update(columns or {})
-                    self.create(
+                        # Not self.create() because that is wrapped in its own
+                        # transaction and Python's sqlite3 doesn't support
+                        # nested transactions.
+                        self.db.create_table(
+                            self.name,
                             column_types,
                             pk,
                             foreign_keys,
@@ -1139,7 +1148,6 @@ class Table(Queryable):
                     flat_values = list(itertools.chain(*values))
                     queries_and_params = [(sql, flat_values)]
 
-            with self.db.conn:
                 for query, params in queries_and_params:
                     try:
                         result = self.db.conn.execute(query, params)

but that fails in tests because other methods call insert/upsert/insert_all/upsert_all in the middle of their transactions, so the BEGIN statement throws an error (no nested transactions allowed).

Stepping back, it would be nice to make the transaction handling systematic and predictable. One way to do this is to make the sqlite_utils/db.py code generally not begin or commit any transactions, and require the caller to do that instead. This lets the caller mix and match the Python API calls into transactions as appropriate (which is impossible for the API methods themselves to fully determine). Then, make sqlite_utils/cli.py begin and commit a transaction in each @cli.command function, making each command robust and consistent in the face of errors. The big change here, and why I didn't just submit a patch, is that it dramatically changes the Python API to require callers to begin a transaction rather than just immediately calling methods.

There is also the caveat that for each transaction, an explicit BEGIN is also necessary so that DDL as well as DML (as well as SELECTs) are consistent and rolled back on error. There are several bugs.python.org discussions around this particular problem of DDL and some plans to make it better and consistent with DBAPI2, eventually. In the meantime, the sqlite-utils Database class could be a context manager which supports the incantations necessary to do proper transactions. This would still be a Python API change for callers but wouldn't expose them to the weirdness of the sqlite3's default transaction handling.

…file This moves the update from the filesystem layer into the SQL layer, thus allowing multiple processes to coordinate. Datasette holds open SQLite connections and was keeping references to the deleted files. The new --truncate option to `sqlite-utils insert` is added in a PR I submitted <simonw/sqlite-utils#118>. For now, sqlite-utils is installed from our fork. When/if --truncate is released officially, we can switch back to installing from PyPI. Resolves #10.

simonw · 2020-07-08T04:55:18Z

This is a really good idea - and thank you for the detailed discussion in the pull request.

I'm keen to discuss how transactions can work better. I tend to use this pattern in my own code:

with db.conn:
    db["table"].insert(...)

But it's not documented and I've not though very hard about it!

I like having inserts that handle 10,000+ rows commit on every chunk so I can watch their progress from another process, but the library should absolutely support people who want to commit all of the rows in a single transaction - or combine changes with DML.

Lots to discuss here. I'll start a new issue.

simonw · 2020-07-08T04:57:38Z

Thoughts on transactions would be much appreciated in #121

simonw · 2020-07-08T04:58:00Z

Oops didn't mean to click "close" there.

simonw · 2020-07-08T05:05:27Z

The only thing missing from this PR is updates to the documentation. Those need to go in two places:

In the Python API docs. I suggest adding a note to this section about bulk inserts: https://github.com/simonw/sqlite-utils/blob/d0cdaaaf00249230e847be3a3b393ee2689fbfe4/docs/python-api.rst#bulk-inserts
In the CLI docs, in this section: https://github.com/simonw/sqlite-utils/blob/d0cdaaaf00249230e847be3a3b393ee2689fbfe4/docs/cli.rst#inserting-json-data

Here's an example of a previous commit that includes updates to both CLI and API documentation: f9473ac#diff-e3e2a9bfd88566b05001b02a3f51d286

Deletes all rows in the table (if it exists) before inserting new rows. SQLite doesn't implement a TRUNCATE TABLE statement but does optimize an unqualified DELETE FROM. This can be handy if you want to refresh the entire contents of a table but a) don't have a PK (so can't use --replace), b) don't want the table to disappear (even briefly) for other connections, and c) have to handle records that used to exist being deleted. Ideally the replacement of rows would appear instantaneous to other connections by putting the DELETE + INSERT in a transaction, but this is very difficult without breaking other code as the current transaction handling is inconsistent and non-systematic. There exists the possibility for the DELETE to succeed but the INSERT to fail, leaving an empty table. This is not much worse, however, than the current possibility of one chunked INSERT succeeding and being committed while the next chunked INSERT fails, leaving a partially complete operation.

tsibley · 2020-07-08T17:05:59Z

The only thing missing from this PR is updates to the documentation.

Ah, yes, thanks for this reminder! I've repushed with doc bits added.

simonw · 2020-07-08T17:26:02Z

Awesome, thank you very much.

Refs #118, #120

…file This moves the update from the filesystem layer into the SQL layer, thus allowing multiple processes to coordinate. Datasette holds open SQLite connections and was keeping references to the deleted files. The new --truncate option to `sqlite-utils insert` is added in sqlite-utils 2.11, from a PR I submitted.¹ Resolves #10. ¹ simonw/sqlite-utils#118

tsibley force-pushed the insert-truncate branch from 892daf8 to 5ef8546 Compare July 7, 2020 17:06

tsibley mentioned this pull request Jul 7, 2020

Deployed front-end doesn't reload to include latest data seattleflu/switchboard#10

Closed

tsibley marked this pull request as draft July 7, 2020 18:46

tsibley force-pushed the insert-truncate branch 3 times, most recently from 595acc3 to 3cf1e8e Compare July 8, 2020 02:01

tsibley marked this pull request as ready for review July 8, 2020 02:17

tsibley force-pushed the insert-truncate branch from 3cf1e8e to a41ebcb Compare July 8, 2020 02:34

simonw mentioned this pull request Jul 8, 2020

Improved (and better documented) support for transactions #121

Open

simonw closed this Jul 8, 2020

simonw reopened this Jul 8, 2020

tsibley force-pushed the insert-truncate branch from a41ebcb to 332f7d7 Compare July 8, 2020 17:05

simonw merged commit ae45933 into simonw:master Jul 8, 2020

simonw added a commit that referenced this pull request Jul 8, 2020

Release 2.11

0f8b042

Refs #118, #120

simonw added a commit that referenced this pull request Jul 8, 2020

Release notes for 2.11

a236a6b

Refs #118, #120

tsibley mentioned this pull request Jul 8, 2020

Update records in SQLite via SQL instead of re-creating the database/file seattleflu/switchboard#12

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add insert --truncate option #118

Add insert --truncate option #118

tsibley commented Jul 6, 2020 •

edited

Loading

tsibley commented Jul 7, 2020

tsibley commented Jul 7, 2020

tsibley commented Jul 8, 2020

simonw commented Jul 8, 2020

simonw commented Jul 8, 2020

simonw commented Jul 8, 2020

simonw commented Jul 8, 2020 •

edited

Loading

tsibley commented Jul 8, 2020

simonw commented Jul 8, 2020

Add insert --truncate option #118

Add insert --truncate option #118

Conversation

tsibley commented Jul 6, 2020 • edited Loading

tsibley commented Jul 7, 2020

tsibley commented Jul 7, 2020

tsibley commented Jul 8, 2020

simonw commented Jul 8, 2020

simonw commented Jul 8, 2020

simonw commented Jul 8, 2020

simonw commented Jul 8, 2020 • edited Loading

tsibley commented Jul 8, 2020

simonw commented Jul 8, 2020

tsibley commented Jul 6, 2020 •

edited

Loading

simonw commented Jul 8, 2020 •

edited

Loading