Description
https://github.com/simonw/sqlite-diffable can be used to dump the content of a sqlite db as text files and reloading them, this is useful to keep track of data/schemas using git.
I tried to use it for dumping and re-loading the llm
logs db and faced some difficulties. Not sure if these are actual problems that need fixing so I am opening this to ask the questions and provide some context.
$ mkdir dump
$ cp "$(llm logs path)" .
$ sqlite-utils tables logs.db --columns --table --counts # show contents of original db
table count columns
--------------------- ------- -------------------------------------------------------------------------------------------------------------------------------------------------
_llm_migrations 14 ['name', 'applied_at']
conversations 12 ['id', 'name', 'model']
responses 15 ['id', 'model', 'prompt', 'system', 'prompt_json', 'options_json', 'response', 'response_json', 'conversation_id', 'duration_ms', 'datetime_utc']
responses_fts 15 ['prompt', 'response']
responses_fts_data 21 ['id', 'block']
responses_fts_idx 19 ['segid', 'term', 'pgno']
responses_fts_docsize 15 ['id', 'sz']
responses_fts_config 1 ['k', 'v']
attachments 0 ['id', 'type', 'path', 'url', 'content']
prompt_attachments 0 ['response_id', 'attachment_id', 'order']
$ sqlite-diffable dump logs.db dump/ --all # dump all tables
when trying to reload, it hits an error
$ sqlite-diffable load logs_recovered.db dump/
Error: table 'responses_fts_docsize' already exists
Use the --replace option to over-write existing tables
$ sqlite-utils tables logs_recovered.db --columns --table --counts
table count columns
--------------------- ------- -------------------------------------------------------------------------------------------------------------------------------------------------
conversations 12 ['id', 'name', 'model']
responses 15 ['id', 'model', 'prompt', 'system', 'prompt_json', 'options_json', 'response', 'response_json', 'conversation_id', 'duration_ms', 'datetime_utc']
responses_fts 15 ['prompt', 'response']
responses_fts_data 8 ['id', 'block']
responses_fts_idx 6 ['segid', 'term', 'pgno']
responses_fts_docsize 15 ['id', 'sz']
responses_fts_config 1 ['k', 'v']
AFAIU this happens because the database uses the FTS module for full-text search and this means that, upon creation and population of the responses_fts
table, 4 other tables are automatically created and populated (responses_fts_data
, responses_fts_idx
, responses_fts_docsize
, responses_fts_config
), this is explained here https://www.sqlite.org/fts5.html#fts5_data_structures
So I tired to using the suggested --replace
option and even though it leads to successful completion of the sqlite-diffable load
command, it then errors when trying to query the database (redacted content for brevity)
$ sqlite-diffable load logs_recovered.db dump/ --replace
$ sqlite-utils tables logs_recovered.db --columns --table --counts
Traceback (most recent call last):
File "/Users/giuli077/.local/bin/sqlite-utils", line 8, in <module>
sys.exit(cli())
^^^^^
...
File "/Users/giuli077/.local/pipx/venvs/sqlite-utils/lib/python3.11/site-packages/sqlite_utils/db.py", line 1315, in count_where
return self.db.execute(sql, where_args or []).fetchone()[0]
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/giuli077/.local/pipx/venvs/sqlite-utils/lib/python3.11/site-packages/sqlite_utils/db.py", line 533, in execute
return self.conn.execute(sql, parameters)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
sqlite3.DatabaseError: vtable constructor failed: responses_fts
I haven't dug into this specific error but I assume something goes wrong because the values of the response_fts_* tables that are automatically created when responses_fts is first loaded are replaced by values of the contents of those tables from the dump and that leads to some inconsistency in that table data.
I then decided to exclude from the load the host FTS-specific tables and only load the rest
$ rm dump/responses_fts_*
$ sqlite-diffable load logs_recovered_partial.db dump/
$ sqlite-utils tables logs_recovered_partial.db --columns --table --counts
table count columns
--------------------- ------- -------------------------------------------------------------------------------------------------------------------------------------------------
conversations 12 ['id', 'name', 'model']
responses 15 ['id', 'model', 'prompt', 'system', 'prompt_json', 'options_json', 'response', 'response_json', 'conversation_id', 'duration_ms', 'datetime_utc']
responses_fts 15 ['prompt', 'response']
responses_fts_data 8 ['id', 'block']
responses_fts_idx 6 ['segid', 'term', 'pgno']
responses_fts_docsize 15 ['id', 'sz']
responses_fts_config 1 ['k', 'v']
prompt_attachments 0 ['response_id', 'attachment_id', 'order']
_llm_migrations 14 ['name', 'applied_at']
attachments 0 ['id', 'type', 'path', 'url', 'content']
this works but I noticed there is some inconsistency with the number of rows in the responses_fts_data
(8 rows) and responses_fts_idx
(6 rows) which in the original db had more rows (21 and 19 respectively)
I am not an expert on how these tables work, I had a look at the changes in responses_fts_idx.ndjson
(which are vaguely human-readable compared to those in responses_fts_data.ndjson
), it seems there are a bunch of removed rows with very similar content "b''"
and also some others
diff --git a/dump/responses_fts_idx.ndjson b/dump/responses_fts_idx.ndjson
index f2446b4..285c8e9 100644
--- a/dump/responses_fts_idx.ndjson
+++ b/dump/responses_fts_idx.ndjson
@@ -1,19 +1,6 @@
[1, "b''", 2]
-[2, "b''", 2]
-[3, "b''", 2]
-[4, "b''", 2]
-[5, "b''", 2]
-[5, "b'0paths'", 4]
-[6, "b''", 2]
-[6, "b'0rou'", 4]
-[7, "b''", 2]
-[7, "b'0returni'", 4]
-[8, "b''", 2]
-[9, "b''", 2]
-[10, "b''", 2]
-[11, "b''", 2]
-[12, "b''", 2]
-[13, "b''", 2]
-[14, "b''", 2]
-[15, "b''", 2]
-[15, "b'0readl'", 4]
+[1, "b'0configured'", 4]
+[1, "b'0fr'", 6]
+[1, "b'0logg'", 8]
+[1, "b'0repr'", 10]
+[1, "b'0timed'", 12]
full-text search still seems to work on the re-loaded logs_recovered_partial.db
so I am not sure this is an actual problem.
My question(s): is this behaviour ok? Would the missing data in responses_fts_data
and responses_fts_idx
cause any potential issue that is not immediately obvious to me? or can I happlily rely on sqlite doing the right thing when it recreates and repopulates the responses_fts_* tables upon re-loading of responses_fts
?
Note: I do realise this question might not be closely related to the llm
cli tool and it is quite likely just a clarification about FTS and using sqlite-diffable
with databases that use it.
Because this is the use-case I encountered I decided to ask in this project anyway :).
It is perhaps an opportunity to consider explicitly pointing people at sqlite-diffable
as a way to version-control the conversations they have via llm
(maybe to add somewhere in the docs).