-
-
Notifications
You must be signed in to change notification settings - Fork 111
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Option for importing CSV data using the SQLite .import mechanism #297
Comments
Full docs here: https://www.sqlite.org/draft/cli.html#csv One catch: how this works has changed in recent SQLite versions: https://www.sqlite.org/changes.html
The "skip" feature is particularly important to understand. https://www.sqlite.org/draft/cli.html#csv says:
But the |
Potential workaround for missing
|
https://unix.stackexchange.com/a/642364 suggests you can also use this to import from stdin, like so:
Here the |
Should also take advantage of |
Another implementation option would be to use the CSV virtual table mechanism. This could avoid shelling out to the (Would be neat to produce a Python wheel of this, see https://simonwillison.net/2022/May/23/bundling-binary-tools-in-python-wheels/) This would also help solve the challenge of making this optimization available to the |
I'm not too worried about |
Since there are all sorts of existing options for
|
The thing I like about that |
Built a prototype of
Takes 13s - about the same time as calling Without the Here's the prototype so far: diff --git a/sqlite_utils/cli.py b/sqlite_utils/cli.py
index 86eddfb..1c83ef6 100644
--- a/sqlite_utils/cli.py
+++ b/sqlite_utils/cli.py
@@ -14,6 +14,8 @@ import io
import itertools
import json
import os
+import shutil
+import subprocess
import sys
import csv as csv_std
import tabulate
@@ -1669,6 +1671,7 @@ def query(
is_flag=True,
help="Analyze resulting tables and output results",
)
+@click.option("--fast", is_flag=True, help="Fast mode, only works with CSV and TSV")
@load_extension_option
def memory(
paths,
@@ -1692,6 +1695,7 @@ def memory(
save,
analyze,
load_extension,
+ fast,
):
"""Execute SQL query against an in-memory database, optionally populated by imported data
@@ -1719,6 +1723,22 @@ def memory(
\b
sqlite-utils memory animals.csv --schema
"""
+ if fast:
+ if (
+ attach
+ or flatten
+ or param
+ or encoding
+ or no_detect_types
+ or analyze
+ or load_extension
+ ):
+ raise click.ClickException(
+ "--fast mode does not support any of the following options: --attach, --flatten, --param, --encoding, --no-detect-types, --analyze, --load-extension"
+ )
+ # TODO: Figure out and pass other supported options
+ memory_fast(paths, sql)
+ return
db = sqlite_utils.Database(memory=True)
# If --dump or --save or --analyze used but no paths detected, assume SQL query is a path:
if (dump or save or schema or analyze) and not paths:
@@ -1791,6 +1811,33 @@ def memory(
)
+def memory_fast(paths, sql):
+ if not shutil.which("sqlite3"):
+ raise click.ClickException("sqlite3 not found in PATH")
+ args = ["sqlite3", ":memory:", "-cmd", ".mode csv"]
+ table_names = []
+
+ def name(path):
+ base_name = pathlib.Path(path).stem or "t"
+ table_name = base_name
+ prefix = 1
+ while table_name in table_names:
+ prefix += 1
+ table_name = "{}_{}".format(base_name, prefix)
+ return table_name
+
+ for path in paths:
+ table_name = name(path)
+ table_names.append(table_name)
+ args.extend(
+ ["-cmd", ".import {} {}".format(pathlib.Path(path).resolve(), table_name)]
+ )
+
+ args.extend(["-cmd", ".mode column"])
+ args.append(sql)
+ subprocess.run(args)
+
+
def _execute_query(
db, sql, param, raw, table, csv, tsv, no_headers, fmt, nl, arrays, json_cols
): |
Pushed that prototype to a branch. |
The more I think about this the more I like it - particularly for I used a variant of this trick with parquet files here: https://simonwillison.net/2022/Sep/5/laion-aesthetics-weeknotes/ |
Should consider how this could best handle creating columns that are integer and float as opposed to just text. https://discord.com/channels/823971286308356157/823971286941302908/1019630014544748584 is a relevant discussion on Discord. Even if you create the schema in advance with the correct column types, this import mechanism can put empty strings in blank float/integer columns when ideally you would want to have nulls. Related feature idea for Not sure how best to handle this for |
One solution suggested on Discord:
|
As an aside, https://avi.im/blag/2021/fast-sqlite-inserts/ inspired my to try pypy since that article claimed to get a 2.5x speedup using pypy compared to regular Python for a CSV import script. Setup:
I grabbed the first 760M of that Then:
Compared to regular Python
So no perceivable performance difference. |
Wrote that up as a TIL: https://til.simonwillison.net/python/pypy-macos |
After reviewing sqlite-utils/docs/cli-reference.rst Lines 251 to 279 in d792dad
I'm going to implement a separate command instead. |
Potential names/designs:
Or more interestingly... what if it could accept multiple CSV files to create multiple tables?
Would still need to support creating tables with different names though. Maybe like this:
I seem to be leaning towards |
Maybe |
I quite like |
The other nice thing about having this as a separate command is that I can implement a tiny subset of the overall |
Currently the only other use of
So I think it's OK to use it to mean something slightly different for this command, since |
This would be awesome to have for multi-gig tsv and csv files! I'm currently looking at a 10 hour countdown for one such important. Not a problem because I'm lazy and happy to let it run and check on it tomorrow.. |
As seen in https://til.simonwillison.net/sqlite/import-csv -
.mode csv
and then.import school.csv schools
is hugely faster than importing viasqlite-utils insert
and doing the work in Python - but it can only be implemented by shelling out to thesqlite3
CLI tool, it's not functionality that is exposed to the Pythonsqlite3
module.An option to use this would be useful - maybe something like this:
The text was updated successfully, but these errors were encountered: