Table of Contents | |
---|---|
![]() |
qsv is a command line program for indexing, slicing, analyzing, splitting, enriching, validating & joining CSV files. Commands are simple, fast & composable. * Available Commands * Installation * Whirlwind Tour * Cookbook * FAQ * Changelog * Benchmarks * NYC School of Data 2022 presentation * Sponsor |
NOTE: qsv is a fork of the popular xsv utility, merging several pending PRs since xsv 0.13.0's May 2018 release. It also has numerous new features & 53 additional commands/subcommands/operations (for a total of 73). See FAQ for more details.
Command | Description |
---|---|
apply1 | Apply series of string, date, currency & geocoding transformations to a CSV column. It also has some basic NLP functions (similarity, sentiment analysis, profanity, eudex & language detection). |
behead | Drop headers from a CSV. |
cat | Concatenate CSV files by row or by column. |
count2 | Count the rows in a CSV file. (Instantaneous with an index.) |
dedup34 | Remove redundant rows. |
enum | Add a new column enumerating rows by adding a column of incremental or uuid identifiers. Can also be used to copy a column or fill a new column with a constant value. |
excel | Exports a specified Excel/ODS sheet to a CSV file. |
exclude2 | Removes a set of CSV data from another set based on the specified columns. |
explode | Explode rows into multiple ones by splitting a column value based on the given separator. |
extsort4 | Sort an arbitrarily large CSV/text file using a multithreaded external merge sort algorithm. |
fetch | Fetches HTML/data from web pages or web services for every row in a URL column. Comes with jql JSON query language support and optional Redis response caching. |
fill | Fill empty values. |
fixlengths | Force a CSV to have same-length records by either padding or truncating them. |
flatten | A flattened view of CSV records. Useful for viewing one record at a time. e.g. qsv slice -i 5 data.csv | qsv flatten . |
fmt | Reformat a CSV with different delimiters, record terminators or quoting rules. (Supports ASCII delimited data.) |
foreach1 | Loop over a CSV to execute bash commands. (not available on Windows) |
frequency25 | Build frequency tables of each column. (Uses multithreading to go faster if an index is present.) |
generate1 | Generate test data by profiling a CSV using Markov decision process machine learning. |
headers | Show the headers of a CSV. Or show the intersection of all headers between many CSV files. |
index | Create an index for a CSV. This is very quick & provides constant time indexing into the CSV file. Also enables multithreading for frequency , split , stats and schema commands. |
input | Read CSV data with special quoting, trimming, line-skipping and UTF-8 transcoding rules. |
join2 | Inner, outer, cross, anti & semi joins. Uses a simple hash index to make it fast. |
jsonl | Convert newline-delimited JSON (JSONL/NDJSON) to CSV. |
lua1 | Execute a Lua script over CSV lines to transform, aggregate or filter them. Embeds Lua 5.4.4. |
partition | Partition a CSV based on a column value. |
pseudo | Pseudonymise the value of the given column by replacing them with an incremental identifier. |
py1 | Evaluate a Python expression over CSV lines to transform, aggregate or filter them. Python's f-strings is particularly useful for extended formatting (Python 3.8+ required). |
rename | Rename the columns of a CSV efficiently. |
replace | Replace CSV data using a regex. |
reverse3 | Reverse order of rows in a CSV. Unlike the sort --reverse command, it preserves the order of rows with the same key. |
sample2 | Randomly draw rows (with optional seed) from a CSV using reservoir sampling (i.e., use memory proportional to the size of the sample). |
schema5 | Infer schema from CSV data and output in JSON Schema format. Uses multithreading to go faster if an index is present. See validate command. |
search | Run a regex over a CSV. Applies the regex to each field individually & shows only matching rows. |
searchset | Run multiple regexes over a CSV in a single pass. Applies the regexes to each field individually & shows only matching rows. |
select | Select, re-order, duplicate or drop columns. |
slice23 | Slice rows from any part of a CSV. When an index is present, this only has to parse the rows in the slice (instead of all rows leading up to the start of the slice). |
sniff | Quickly sniffs CSV details (delimiter, quote character, number of columns, data types, header row, preamble rows). |
sort4 | Sorts CSV data in alphabetical, numerical, reverse or random (with optional seed) order. |
split25 | Split one CSV file into many CSV files of N chunks. (Uses multithreading to go faster if an index is present.) |
stats235 | Infer data type & compute descriptive statistics for each column in a CSV (sum, min/max, min/max length, mean, stddev, variance, quartiles, IQR, lower/upper fences, skew, median, mode, cardinality & nullcount). Uses multithreading to go faster if an index is present. |
table3 | Show aligned output of a CSV using elastic tabstops. |
transpose3 | Transpose rows/columns of a CSV. |
validate4 | Validate CSV data with JSON Schema (See schema command). If no jsonschema file is provided, validates if a CSV conforms to the RFC 4180 standard. |
Pre-built binaries for Windows, Linux and macOS are available from GitHub.
There are three versions of qsv. qsv
supports features, with the pre-built binaries enabling all valid platform features6;
qsvlite
has all features disabled (half the size of qsv
); qsvdp
is optimized for use with DataPusher+, with only DataPusher+ relevant commands and the self-update engine removed (a sixth of the size of qsv
).
Alternatively, you can compile from source by
installing Cargo
(Rust's package manager)
and installing qsv
using Cargo:
cargo install qsv --features full
If you encounter compilation errors, ensure you're using the exact version of the dependencies qsv was built with by issuing:
cargo install qsv --locked --features full
Compiling from this repository also works similarly:
git clone git@github.com:jqnatividad/qsv.git
cd qsv
cargo build --release --features full
# or if you encounter compilation errors
cargo build --release --locked --features full
The compiled binary will end up in ./target/release/
.
To enable optional features, use cargo --features
(see Feature Flags for more info):
cargo install qsv --features apply,generate,lua,fetch,foreach,python,full
# or to build qsvlite
cargo install qsv --features lite
# or to build qsvdp
cargo install qsv --features datapusher_plus
# or when compiling from a local repo
cargo build --release --features apply,generate,lua,fetch,foreach,python,full
# for qsvlite
cargo build --release --features lite
# for qsvdp
cargo build --release --features datapusher_plus
Building qsv requires Rust stable - currently version 1.60.0.
qsv's command-line options are quite extensive. Thankfully, since it uses docopt for CLI processing, we can take advantage of docopt.rs' tab completion support to make it easier to use qsv at the command-line (currently, only bash shell is supported):
# install docopt-wordlist
cargo install docopt
# IMPORTANT: run these commands from the root directory of your qsv git repository
# to setup bash qsv tab completion
echo "DOCOPT_WORDLIST_BIN=\"$(which docopt-wordlist)"\" >> $HOME/.bash_completion
echo "source \"$(pwd)/scripts/docopt-wordlist.bash\"" >> $HOME/.bash_completion
echo "complete -F _docopt_wordlist_commands qsv" >> $HOME/.bash_completion
qsv recognizes UTF-8/ASCII encoded, CSV (.csv
) and TSV files (.tsv
and .tab
). CSV files are assummed to have "," (comma) as a delimiter,
and TSV files, "\t" (tab) as a delimiter. The delimiter is a single ascii character that can be set either by the --delimiter
command-line option or
with the QSV_DEFAULT_DELIMITER
environment variable or automatically detected when QSV_SNIFF_DELIMITER
is set.
When using the --output
option, note that qsv will UTF-8 encode the file and automatically change the delimiter used in the generated file based on the file extension - i.e. comma for .csv
, tab for .tsv
and .tab
files.
JSONL/NDJSON files are also recognized and converted to CSV with the jsonl
command.
The fetch
command also produces JSONL files when its invoked without the --new-column
option.
The excel
command recognizes Excel and Open Document Spreadsheet(ODS) files (.xls
, .xlsx
, .xlsm
, .xlsb
and .ods
files).
qsv requires UTF-8 encoded (of which ASCII is a subset) input files. On startup, it scans the input if it's UTF-8 encoded (for files, the first 8k; for stdin, the entire buffer), and will abort if its not unless QSV_SKIPUTF8_CHECK
is set. On Linux and macOS, UTF-8 encoding is the default.
Should you need to reencode CSV/TSV files, you can use the input
command to transcode to UTF-8. It will replace all invalid UTF-8 sequences with �
. Alternatively, there are several utilities you can use to do so on Linux/macOS and Windows.
Unlike other modern operating systems, Windows' default encoding is UTF16-LE. This will cause problems when redirecting qsv's output to a CSV file and trying to open it with Excel (which ignores the comma delimiter, with everything in the first column):
qsv stats wcp.csv > wcpstats.csv
Which is weird, since you would think Microsoft Excel would properly recognize UTF16-LE encoded CSV files. Regardless, to create a properly UTF-8 encoded file, use the --output
option instead:
qsv stats wcp.csv --output wcpstats.csv
QSV_DEFAULT_DELIMITER
- single ascii character to use as delimiter. Overrides--delimeter
option. Defaults to "," (comma) for CSV files and "\t" (tab) for TSV files, when not set. Note that this will also set the delimiter for qsv's output to stdout. However, using the--output
option, regardless of this environment variable, will automatically change the delimiter used in the generated file based on the file extension - i.e. comma for.csv
, tab for.tsv
and.tab
files.QSV_SNIFF_DELIMITER
- when set, the delimiter is automatically detected. OverridesQSV_DEFAULT_DELIMITER
and--delimiter
option.QSV_NO_HEADERS
- when set, the first row will NOT be interpreted as headers. SupersedesQSV_TOGGLE_HEADERS
.QSV_TOGGLE_HEADERS
- if set to1
, toggles header setting - i.e. inverts qsv header behavior, with no headers being the default, and setting--no-headers
will actually mean headers will not be ignored.QSV_AUTOINDEX
- when set, automatically create an index when none is detected. Also automatically updates stale indices.QSV_SKIPUTF8_CHECK
- when set, skip UTF-8 encoding check. Otherwise, qsv scans the first 8k of files. For stdin, it scans the entire buffer.QSV_MAX_JOBS
- number of jobs to use for multithreaded commands (currentlydedup
,extsort
,frequency
,schema
,sort
,split
,stats
andvalidate
). If not set, max_jobs is set to the detected number of logical processors. See Multithreading for more info.QSV_REGEX_UNICODE
- if set, makessearch
,searchset
andreplace
commands unicode-aware. For increased performance, these commands are not unicode-aware and will ignore unicode values when matching and will panic when unicode characters are used in the regex.QSV_RDR_BUFFER_CAPACITY
- set to change reader buffer size (bytes - default when not set: 16384)QSV_WTR_BUFFER_CAPACITY
- set to change writer buffer size (bytes - default when not set: 65536)QSV_COMMENT_CHAR
- set to a comment character which will ignore any lines (including the header) that start with this character (default: comments disabled).QSV_LOG_LEVEL
- set to desired level (default - off, error, warn, info, trace, debug).QSV_LOG_DIR
- when logging is enabled, the directory where the log files will be stored. If the specified directory does not exist, qsv will attempt to create it. If not set, the log files are created in the directory where qsv was started. See Logging for more info.QSV_NO_UPDATE
- prohibit self-update version check for the latest qsv release published on GitHub.QSV_REDIS_CONNECTION_STRING
- thefetch
command can use Redis to cache responses. By default it connects toredis:127.0.0.1:6379
. Set to connect to another Redis instance.QSV_REDIS_TTL_SECONDS
- by default, Redis cached values have a time-to-live of 2,419,200 seconds (28 days).QSV_REDIS_TTL_REFRESH
- set to enable cache hits to refresh TTL of cached values.
Several dependencies also have environment variables that influence qsv's performance & behavior:
- Memory Management (mimalloc) When incorporating qsv into a data pipeline that runs in batch mode, particularly with very large CSV files using qsv commands that load entire CSV files into memory, you can fine-tune Mimalloc's behavior using its environment variables.
- Network Access (reqwest)
qsv uses reqwest for its
fetch
,validate
and--update
functions and will honor proxy settings set throughHTTP_PROXY
,HTTPS_PROXY
andNO_PROXY
.
NOTE: To get a list of all active qsv-relevant environment variables, run
qsv --envlist
.
qsv
has several features:
mimalloc
(default) - use the mimalloc allocator (see Memory Allocator for more info).apply
- enableapply
command. This swiss-army knife of CSV transformations is very powerful, but it has a lot of dependencies that increases both compile time and binary size.fetch
- enablefetch
command.generate
- enablegenerate
command.full
- enable to build qsv.lite
- enable to build qsvlite.datapusher_plus
- enable to build qsvdp.nightly
- enable to turn on nightly/unstable features in therand
andregex
creates when building with Rust nightly/unstable.
The following "power-user" commands can be abused and present "foot-shooting" scenarios.
lua
- enablelua
command.foreach
- enableforeach
command (not valid for Windows).python
- enablepy
command (requires Python 3.8+). Note that qsv will automatically use the currently activated python version when run in a virtual environment.
NOTE:
qsvlite
, as the name implies, always has non-default features disabled.qsv
can be built with any combination of the above features using the cargo--features
&--no-default-features
flags. The pre-builtqsv
binaries has all applicable features enabled for the target platform6.
Modern CPUs have various features that the Rust compiler can take advantage of to increase performance. If you want the compiler to take advantage of these CPU-specific speed-ups, set this environment variable BEFORE installing/compiling qsv:
On Linux and macOS:
export CARGO_BUILD_RUSTFLAGS='-C target-cpu=native'
On Windows Powershell:
$env:CARGO_BUILD_RUSTFLAGS='-C target-cpu=native'
Do note though that the resulting binary will only run on machines with the
same architecture as the machine you installed/compiled from.
To find out your CPU architecture and other valid values for target-cpu
:
rustc --print target-cpus
# to find out what CPU features are used by the Rust compiler WITHOUT specifying target-cpu
rustc --print cfg | grep -i target_feature
# to find out what additional CPU features will be used by the Rust compiler when you specify target-cpu=native
rustc --print cfg -C target-cpu=native | grep -i target_feature
# to get a short explanation of each CPU target-feature
rustc --print target-features
By default, qsv uses an alternative allocator - mimalloc,
a performance-oriented allocator from Microsoft.
If you want to use the standard allocator, use the --no-default-features
flag
when installing/compiling qsv, e.g.:
cargo install qsv --path . --no-default-features
or
cargo build --release --no-default-features
To find out what memory allocator qsv is using, run qsv --version
. After the qsv version number, the allocator used is displayed ("standard
" or "mimalloc
"). Note that mimalloc is not supported on the x86_64-pc-windows-gnu
and arm
targets, and you'll need to use the "standard" allocator on those platforms.
Depending on your filesystem's configuration (e.g. block size, file system type, writing to remote file systems (e.g. sshfs, efs, nfs), SSD or rotating magnetic disks, etc.), you can also fine-tune qsv's read/write buffers.
By default, the read buffer size is set to 16k, you can change it by setting the environment
variable QSV_RDR_BUFFER_CAPACITY
in bytes.
The same is true with the write buffer (default: 64k) with the QSV_WTR_BUFFER_CAPACITY
environment variable.
Several commands support multithreading - stats
, frequency
, schema
and split
(when an index is available); dedup
, extsort
, sort
and validate
(no index required).
qsv will automatically spawn parallel jobs equal to the detected number of logical processors. Should you want to manually override this, use the --jobs
command-line option or the QSV_MAX_JOBS
environment variable.
To find out your jobs setting, call qsv --version
. The second to the last number is the number of jobs qsv will use for multithreaded commands. The last number is the number of logical processors detected by qsv. For example:
$ qsv --version
qsv 0.46.1-mimalloc-apply;fetch;generate;lua;python;-16-16
Shows that I'm running qsv version 0.46.1, with the mimalloc
allocator (instead of standard
), and I have the apply
, fetch
, generate
, lua
and python
features enabled, and qsv will be using 16 logical processors out of 16 detected when running multithreaded commands.
The apply geocode
command memoizes otherwise expensive geocoding operations and will report its cache hit rate. apply geocode
memoization, however, is not persistent across sessions.
The fetch
command also memoizes expensive REST API calls with its optional Redis support. It effectively has a persistent cache as the default time-to-live (TTL) before a Redis cache entry is expired is 28 days and Redis entries are persisted across restarts. Redis cache settings can be fine-tuned with the QSV_REDIS_CONNECTION_STRING
, QSV_REDIS_TTL_SECONDS
and QSV_REDIS_TTL_REFRESH
environment variables.
Rust strings are utf-8 encoded. As a result, qsv requires UTF-8 encoded files, primarily, for performance. It makes extensive use of str::from_utf8_unchecked
to skip utf-8 validation that str::from_utf8
will otherwise incur everytime raw bytes are converted to string.
For the most part, this shouldn't be a problem as UTF-8 is the de facto encoding standard. Should you need to process a CSV file with a different encoding, use the input
command to "transcode" to UTF-8.
Pre-built binaries compiled using Rust Nightly/Unstable are also available for download. These binaries are optimized for size and speed:
- compiled with the current Rust nightly/unstable at the time of release.
- stdlib is compiled from source, instead of using the pre-built stdlib. This ensures stdlib is compiled with all of qsv's release settings (link time optimization, opt-level, codegen-units, panic=abort, etc.). This is why we only have nightly release builds for select platforms (the platform of GitHub's action runners), as we need access to the "native hardware" and cannot cross-compile stdlib to other platforms.
- set
panic=abort
- removing panic-handling/formatting and backtrace code, making for smaller binaries. - enables unstable/nightly features on
regex
andrand
crates, that unlock performance/SIMD features on those crates.
Despite the 'unstable' label, these binaries are actually quite stable, given how Rust is made,
and the fact that qsv itself doesn't actually use any unstable feature flags, beyond activating the 'unstable' features in the rand
and regex
crates, which is really more about performance (that's why we can still compile with Rust stable). You only really loose the backtrace messages when qsv panics.
If you need to maximize speed/performance - use the nightly builds. If you prefer a "safer", rock-solid experience, use the stable builds.
If you want to really squeeze every little bit of performance from qsv, build it locally like how the Nightly Release Builds are built. Doing so will ensure CPU features are tailored to your hardware and you're using the latest Rust nightly. For example, on Ubuntu 22.04 LTS Linux:
rustup default nightly
rustup update
export RUSTFLAGS='-C target-cpu=native'
# to build qsv on nightly with all features. The binary will be in the target/release-nightly folder.
cargo build --profile release-nightly --bin qsv -Z build-std=std,panic_abort -Z build-std-features=panic_immediate_abort --features full,apply,generate,lua,fetch,foreach,python,nightly --target x86_64-unknown-linux-gnu
# to build qsvlite
cargo build --profile release-nightly --bin qsvlite -Z build-std=std,panic_abort -Z build-std-features=panic_immediate_abort --features lite,nightly --target x86_64-unknown-linux-gnu
# to build qsvdp
cargo build --profile release-nightly --bin qsvdp -Z build-std=std,panic_abort -Z build-std-features=panic_immediate_abort --features datapusher_plus,nightly --target x86_64-unknown-linux-gnu
Use and fine-tune the benchmark script when tweaking qsv's performance to your environment. Don't be afraid to change the benchmark data and the qsv commands to something that is more representative of your workloads.
Use the generated benchmark TSV files to meter and compare performance across platforms. You'd be surprised how performance varies
across environments - e.g. qsv's join
performs abysmally on Windows's WSL running Ubuntu 20.04 LTS, taking 172.44 seconds.
On the same machine, running in a VirtualBox VM at that with the same Ubuntu version, join
was done in 1.34 seconds -
two orders of magnitude faster!
However, stats
performs two times faster on WSL vs the VirtualBox VM - 2.80 seconds vs 5.33 seconds for the stats_index
benchmark.
Dual-licensed under MIT or the UNLICENSE.
qsv was made possible by |
---|
![]() |
Standards-based, best-of-breed, open source solutions to make your Data Useful, Usable & Used. |
This project is unrelated to Intel's Quick Sync Video.
Footnotes
-
enabled by optional feature flag. Not available on
qsvlite
. ↩ ↩2 ↩3 ↩4 ↩5 -
loads the entire CSV into memory. Note that
dedup
,stats
&transpose
have modes that do not load the entire CSV into memory. ↩ ↩2 ↩3 ↩4 ↩5 ↩6 -
The
foreach
feature is not available on Windows. Thepython
feature is not enabled on cross-compiled pre-built binaries as we don't have access to a native python interpreter for those platforms (aarch64, i686, and arm) on GitHub's action runners. Compile natively on those platforms with Python 3.8+ installed, if you want to enable thepython
feature. ↩ ↩2