Option for importing CSV data using the SQLite .import mechanism #297

simonw · 2021-07-14T22:36:41Z

As seen in https://til.simonwillison.net/sqlite/import-csv - .mode csv and then .import school.csv schools is hugely faster than importing via sqlite-utils insert and doing the work in Python - but it can only be implemented by shelling out to the sqlite3 CLI tool, it's not functionality that is exposed to the Python sqlite3 module.

An option to use this would be useful - maybe something like this:

sqlite-utils insert blah.db blah blah.csv --fast

The text was updated successfully, but these errors were encountered:

simonw · 2021-07-14T22:40:01Z

Full docs here: https://www.sqlite.org/draft/cli.html#csv

One catch: how this works has changed in recent SQLite versions: https://www.sqlite.org/changes.html

2020-12-01 (3.34.0) - "Table name quoting works correctly for the .import dot-command"
2020-05-22 (3.32.0) - "Add options to the .import command: --csv, --ascii, --skip"
2017-08-01 (3.20.0) - " The ".import" command ignores an initial UTF-8 BOM."

The "skip" feature is particularly important to understand. https://www.sqlite.org/draft/cli.html#csv says:

There are two cases to consider: (1) Table "tab1" does not previously exist and (2) table "tab1" does already exist.

In the first case, when the table does not previously exist, the table is automatically created and the content of the first row of the input CSV file is used to determine the name of all the columns in the table. In other words, if the table does not previously exist, the first row of the CSV file is interpreted to be column names and the actual data starts on the second row of the CSV file.

For the second case, when the table already exists, every row of the CSV file, including the first row, is assumed to be actual content. If the CSV file contains an initial row of column labels, you can cause the .import command to skip that initial row using the "--skip 1" option.

But the --skip 1 option is only available in 3.32.0 and higher.

simonw · 2021-07-14T22:42:11Z

Potential workaround for missing --skip implementation is that the filename can be a command instead, so maybe it could shell out to tail -n +1 filename:

The source argument is the name of a file to be read or, if it begins with a "|" character, specifies a command which will be run to produce the input CSV data.

simonw · 2021-07-14T22:44:05Z

https://unix.stackexchange.com/a/642364 suggests you can also use this to import from stdin, like so:

sqlite3 -csv $database_file_name ".import '|cat -' $table_name"

Here the sqlite3 -csv is an alternative to using .mode csv.

simonw · 2021-07-14T22:48:41Z

Should also take advantage of .mode tabs to support sqlite-utils insert blah.db blah blah.csv --tsv --fast

simonw · 2021-07-18T12:57:54Z

Another implementation option would be to use the CSV virtual table mechanism. This could avoid shelling out to the sqlite3 binary, but requires solving the harder problem of compiling and distributing a loadable SQLite module: https://www.sqlite.org/csv.html

(Would be neat to produce a Python wheel of this, see https://simonwillison.net/2022/May/23/bundling-binary-tools-in-python-wheels/)

This would also help solve the challenge of making this optimization available to the sqlite-utils memory command. That command operates against an in-memory database so it's not obvious how it could shell out to a binary.

simonw · 2021-07-18T12:59:20Z

I'm not too worried about sqlite-utils memory because if your data is large enough that you can benefit from this optimization you probably should use a real file as opposed to a disposable memory database when analyzing it.

simonw · 2022-06-21T00:35:20Z

Relevant TIL: https://til.simonwillison.net/sqlite/one-line-csv-operations

simonw · 2022-06-21T14:49:12Z

Since there are all sorts of existing options for sqlite-utils insert that won't work with this, maybe it would be better to have an entirely separate command - this for example:

sqlite-utils fast-insert data.db mytable data.csv

simonw · 2022-06-21T18:44:03Z

The thing I like about that --fast option is that it could selectively use this alternative mechanism just for the files for which it can work (CSV and TSV files). I could also add a --fast option to sqlite-utils memory which could then kick in only for operations that involve just TSV and CSV files.

simonw · 2022-06-21T19:19:22Z

Built a prototype of --fast for the sqlite-utils memory command:

% time sqlite-utils memory taxi.csv 'SELECT passenger_count, COUNT(*), AVG(total_amount) FROM taxi GROUP BY passenger_count' --fast
passenger_count  COUNT(*)  AVG(total_amount)
---------------  --------  -----------------
                 128020    32.2371511482553 
0                42228     17.0214016766151 
1                1533197   17.6418833067999 
2                286461    18.0975870711456 
3                72852     17.9153958710923 
4                25510     18.452774990196  
5                50291     17.2709248175672 
6                32623     17.6002964166367 
7                2         87.17            
8                2         95.705           
9                1         113.6            
sqlite-utils memory taxi.csv  --fast  12.71s user 0.48s system 104% cpu 12.627 total

Takes 13s - about the same time as calling sqlite3 :memory: ... directly as seen in https://til.simonwillison.net/sqlite/one-line-csv-operations

Without the --fast option that takes several minutes (262s = 4m20s)!

Here's the prototype so far:

diff --git a/sqlite_utils/cli.py b/sqlite_utils/cli.py
index 86eddfb..1c83ef6 100644
--- a/sqlite_utils/cli.py
+++ b/sqlite_utils/cli.py
@@ -14,6 +14,8 @@ import io
 import itertools
 import json
 import os
+import shutil
+import subprocess
 import sys
 import csv as csv_std
 import tabulate
@@ -1669,6 +1671,7 @@ def query(
     is_flag=True,
     help="Analyze resulting tables and output results",
 )
+@click.option("--fast", is_flag=True, help="Fast mode, only works with CSV and TSV")
 @load_extension_option
 def memory(
     paths,
@@ -1692,6 +1695,7 @@ def memory(
     save,
     analyze,
     load_extension,
+    fast,
 ):
     """Execute SQL query against an in-memory database, optionally populated by imported data
 
@@ -1719,6 +1723,22 @@ def memory(
     \b
         sqlite-utils memory animals.csv --schema
     """
+    if fast:
+        if (
+            attach
+            or flatten
+            or param
+            or encoding
+            or no_detect_types
+            or analyze
+            or load_extension
+        ):
+            raise click.ClickException(
+                "--fast mode does not support any of the following options: --attach, --flatten, --param, --encoding, --no-detect-types, --analyze, --load-extension"
+            )
+        # TODO: Figure out and pass other supported options
+        memory_fast(paths, sql)
+        return
     db = sqlite_utils.Database(memory=True)
     # If --dump or --save or --analyze used but no paths detected, assume SQL query is a path:
     if (dump or save or schema or analyze) and not paths:
@@ -1791,6 +1811,33 @@ def memory(
     )
 
 
+def memory_fast(paths, sql):
+    if not shutil.which("sqlite3"):
+        raise click.ClickException("sqlite3 not found in PATH")
+    args = ["sqlite3", ":memory:", "-cmd", ".mode csv"]
+    table_names = []
+
+    def name(path):
+        base_name = pathlib.Path(path).stem or "t"
+        table_name = base_name
+        prefix = 1
+        while table_name in table_names:
+            prefix += 1
+            table_name = "{}_{}".format(base_name, prefix)
+        return table_name
+
+    for path in paths:
+        table_name = name(path)
+        table_names.append(table_name)
+        args.extend(
+            ["-cmd", ".import {} {}".format(pathlib.Path(path).resolve(), table_name)]
+        )
+
+    args.extend(["-cmd", ".mode column"])
+    args.append(sql)
+    subprocess.run(args)
+
+
 def _execute_query(
     db, sql, param, raw, table, csv, tsv, no_headers, fmt, nl, arrays, json_cols
 ):

simonw · 2022-06-21T19:25:44Z

Pushed that prototype to a branch.

simonw · 2022-09-08T15:33:11Z

The more I think about this the more I like it - particularly for sqlite-utils fast-insert where differences in features aren't a problem.

I used a variant of this trick with parquet files here: https://simonwillison.net/2022/Sep/5/laion-aesthetics-weeknotes/

simonw · 2022-09-14T15:57:09Z

Should consider how this could best handle creating columns that are integer and float as opposed to just text.

https://discord.com/channels/823971286308356157/823971286941302908/1019630014544748584 is a relevant discussion on Discord. Even if you create the schema in advance with the correct column types, this import mechanism can put empty strings in blank float/integer columns when ideally you would want to have nulls.

Related feature idea for sqlite-utils transform:

sqlite-utils transform should set empty strings to null when converting text columns to integer/float #488

Not sure how best to handle this for sqlite3 .import imports.

simonw · 2022-09-14T15:57:41Z

One solution suggested on Discord:

wget https://static.openfoodfacts.org/data/en.openfoodfacts.org.products.csv
CREATE=`curl -s -L https://gist.githubusercontent.com/CharlesNepote/80fb813a416ad445fdd6e4738b4c8156/raw/032af70de631ff1c4dd09d55360f242949dcc24f/create.sql`
INDEX=`curl -s -L https://gist.githubusercontent.com/CharlesNepote/80fb813a416ad445fdd6e4738b4c8156/raw/032af70de631ff1c4dd09d55360f242949dcc24f/index.sql`
time sqlite3 products_new.db <<EOS
/* Optimisations. See: https://avi.im/blag/2021/fast-sqlite-inserts/ */;
PRAGMA page_size = 32768;
$CREATE
.mode ascii
.separator "\t" "\n"
.import --skip 1 en.openfoodfacts.org.products.csv all
$INDEX
EOS
# Converting empty to NULL for columns which are either FLOAT or INTEGER
time sqlite3 products.db ".schema all" | sed -nr 's/.*\[(.*)\] (INTEGER|FLOAT).*/\1/gp' | xargs -I % sqlite3 products.db -cmd "PRAGMA journal_mode=OFF;" "UPDATE [all] SET [%] = NULL WHERE [%] = '';"

simonw · 2022-09-14T18:28:53Z

As an aside, https://avi.im/blag/2021/fast-sqlite-inserts/ inspired my to try pypy since that article claimed to get a 2.5x speedup using pypy compared to regular Python for a CSV import script.

Setup:

brew install pypy3
cd /tmp
pypy3 -m venv venv
source venv/bin/activate
pip install sqlite-utils

I grabbed the first 760M of that https://static.openfoodfacts.org/data/en.openfoodfacts.org.products.csv file (didn't wait for the whole thing to download).

Then:

time sqlite-utils insert pypy.db t en.openfoodfacts.org.products.csv --csv
  [------------------------------------]    0%
  [###################################-]   99% 
11.76s user 2.26s system 93% cpu 14.981 total

Compared to regular Python sqlite-utils doing the same thing:

time sqlite-utils insert py.db t en.openfoodfacts.org.products.csv --csv
  [------------------------------------]    0%
  [###################################-]   99%  
11.36s user 2.06s system 93% cpu 14.341 total

So no perceivable performance difference.

simonw · 2022-09-14T18:39:50Z

Wrote that up as a TIL: https://til.simonwillison.net/python/pypy-macos

simonw · 2022-09-29T22:54:13Z

After reviewing sqlite-utils insert --help I'm confident that MOST of these options wouldn't make sense for a "fast" moder that just supports CSV and works by piping directly to the sqlite3 binary:

sqlite-utils/docs/cli-reference.rst

Lines 251 to 279 in d792dad

    
               Options: 
        
                 --pk TEXT                 Columns to use as the primary key, e.g. id 
        
                 --flatten                 Flatten nested JSON objects, so {"a": {"b": 1}} 
        
                                           becomes {"a_b": 1} 
        
                 --nl                      Expect newline-delimited JSON 
        
                 -c, --csv                 Expect CSV input 
        
                 --tsv                     Expect TSV input 
        
                 --lines                   Treat each line as a single value called 'line' 
        
                 --text                    Treat input as a single value called 'text' 
        
                 --convert TEXT            Python code to convert each item 
        
                 --import TEXT             Python modules to import 
        
                 --delimiter TEXT          Delimiter to use for CSV files 
        
                 --quotechar TEXT          Quote character to use for CSV/TSV 
        
                 --sniff                   Detect delimiter and quote character 
        
                 --no-headers              CSV file has no header row 
        
                 --encoding TEXT           Character encoding for input, defaults to utf-8 
        
                 --batch-size INTEGER      Commit every X records 
        
                 --alter                   Alter existing table to add any missing columns 
        
                 --not-null TEXT           Columns that should be created as NOT NULL 
        
                 --default <TEXT TEXT>...  Default value that should be set for a column 
        
                 -d, --detect-types        Detect types for columns in CSV/TSV data 
        
                 --analyze                 Run ANALYZE at the end of this operation 
        
                 --load-extension TEXT     Path to SQLite extension, with optional :entrypoint 
        
                 --silent                  Do not show progress bar 
        
                 --ignore                  Ignore records if pk already exists 
        
                 --replace                 Replace records if pk already exists 
        
                 --truncate                Truncate table before inserting records, if table 
        
                                           already exists 
        
                 -h, --help                Show this message and exit.

I'm going to implement a separate command instead.

simonw · 2022-09-29T22:56:53Z

Potential names/designs:

sqlite-utils fast data.db rows rows.csv
sqlite-utils insert-fast data.db rows rows.csv
sqlite-utils fast-csv data.db rows rows.csv

Or more interestingly... what if it could accept multiple CSV files to create multiple tables?

sqlite-utils fast data.db rows.csv other.csv

Would still need to support creating tables with different names though. Maybe like this:

sqlite-utils fast data.db -t mytable rows.csv -t othertable other.csv

I seem to be leaning towards fast as the command name, but as a standalone command name it's a bit meaningless - how do we know that's about CSV import and not about fast querying or similar?

simonw · 2022-09-29T22:57:31Z

Maybe sqlite-utils fast-csv is right? Not entirely clear that's an insert though as opposed to a faster version of in-memory querying in the style of sqlite-utils memory.

simonw · 2022-09-29T22:59:28Z

I quite like sqlite-utils fast-csv - I think it's clear enough what it does, and running --help can clarify if needed.

simonw · 2022-09-29T23:02:52Z

The other nice thing about having this as a separate command is that I can implement a tiny subset of the overall sqlite-utils insert features at first, and then add additional features in subsequent releases.

simonw · 2022-09-29T23:06:44Z

Currently the only other use of -t is for this:

  -t, --table            Output as a formatted table

So I think it's OK to use it to mean something slightly different for this command, since sqlite-utils insert doesn't do any output of data in any format.

radusuciu · 2023-09-22T20:49:51Z

This would be awesome to have for multi-gig tsv and csv files! I'm currently looking at a 10 hour countdown for one such important. Not a problem because I'm lazy and happy to let it run and check on it tomorrow..

simonw added enhancement New feature or request cli-tool labels Jul 14, 2021

simonw added a commit that referenced this issue Jun 21, 2022

Prototype of sqlite-utils memory --fast, refs #297

458daa0

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Option for importing CSV data using the SQLite .import mechanism #297

Option for importing CSV data using the SQLite .import mechanism #297

simonw commented Jul 14, 2021

simonw commented Jul 14, 2021 •

edited

Loading

simonw commented Jul 14, 2021

simonw commented Jul 14, 2021

simonw commented Jul 14, 2021

simonw commented Jul 18, 2021 •

edited

Loading

simonw commented Jul 18, 2021

simonw commented Jun 21, 2022

simonw commented Jun 21, 2022

simonw commented Jun 21, 2022

simonw commented Jun 21, 2022 •

edited

Loading

simonw commented Jun 21, 2022

simonw commented Sep 8, 2022

simonw commented Sep 14, 2022

simonw commented Sep 14, 2022

simonw commented Sep 14, 2022 •

edited

Loading

simonw commented Sep 14, 2022

simonw commented Sep 29, 2022

simonw commented Sep 29, 2022

simonw commented Sep 29, 2022 •

edited

Loading

simonw commented Sep 29, 2022

simonw commented Sep 29, 2022

simonw commented Sep 29, 2022

radusuciu commented Sep 22, 2023

Option for importing CSV data using the SQLite .import mechanism #297

Option for importing CSV data using the SQLite .import mechanism #297

Comments

simonw commented Jul 14, 2021

simonw commented Jul 14, 2021 • edited Loading

simonw commented Jul 14, 2021

simonw commented Jul 14, 2021

simonw commented Jul 14, 2021

simonw commented Jul 18, 2021 • edited Loading

simonw commented Jul 18, 2021

simonw commented Jun 21, 2022

simonw commented Jun 21, 2022

simonw commented Jun 21, 2022

simonw commented Jun 21, 2022 • edited Loading

simonw commented Jun 21, 2022

simonw commented Sep 8, 2022

simonw commented Sep 14, 2022

simonw commented Sep 14, 2022

simonw commented Sep 14, 2022 • edited Loading

simonw commented Sep 14, 2022

simonw commented Sep 29, 2022

simonw commented Sep 29, 2022

simonw commented Sep 29, 2022 • edited Loading

simonw commented Sep 29, 2022

simonw commented Sep 29, 2022

simonw commented Sep 29, 2022

radusuciu commented Sep 22, 2023

simonw commented Jul 14, 2021 •

edited

Loading

simonw commented Jul 18, 2021 •

edited

Loading

simonw commented Jun 21, 2022 •

edited

Loading

simonw commented Sep 14, 2022 •

edited

Loading

simonw commented Sep 29, 2022 •

edited

Loading