Name	Name	Last commit message	Last commit date
Latest commit History 923 Commits
.github	.github
docs	docs
scripts	scripts
src	src
tests	tests
.gitattributes	.gitattributes
.gitignore	.gitignore
.gitmodules	.gitmodules
CHANGELOG.md	CHANGELOG.md
COPYING	COPYING
Cargo.lock	Cargo.lock
Cargo.toml	Cargo.toml
LICENSE-MIT	LICENSE-MIT
Makefile	Makefile
README.md	README.md
UNLICENSE	UNLICENSE
rustfmt.toml	rustfmt.toml

qsv: Ultra-fast, data-wrangling CLI toolkit for CSVs

qsv is a command line program for indexing, slicing, analyzing, splitting, enriching, validating & joining CSV files. Commands are simple, fast and composable:

Simple tasks are easy.
Performance trade offs are exposed in the CLI interface.
Composition does not come at the expense of performance.

NOTE: qsv is a fork of the popular xsv utility, merging several pending PRs since xsv 0.13.0's release, along with additional features & commands for data-wrangling. See FAQ for more details. (NEW and EXTENDED commands are marked accordingly).

Available commands

Command	Description
apply	Apply series of string, profanity, similarity, date, currency & geocoding transformations to a CSV column. (NEW)
behead	Drop headers from a CSV. (NEW)
cat	Concatenate CSV files by row or by column.
count¹	Count the rows in a CSV file. (Instantaneous with an index.)
dedup²	Remove redundant rows. (NEW)
enum	Add a new column enumerating rows by adding a column of incremental or uuid identifiers. Can also be used to copy a column or fill a new column with a constant value. (NEW)
exclude¹	Removes a set of CSV data from another set based on the specified columns. (NEW)
explode	Explode rows into multiple ones by splitting a column value based on the given separator. (NEW)
fill	Fill empty values. (NEW)
fixlengths	Force a CSV to have same-length records by either padding or truncating them.
flatten	A flattened view of CSV records. Useful for viewing one record at a time. e.g. `qsv slice -i 5 data.csv \| qsv flatten`.
fmt	Reformat a CSV with different delimiters, record terminators or quoting rules. (Supports ASCII delimited data.) (EXTENDED)
foreach	Loop over a CSV to execute bash commands. (nix only) (NEW)*
frequency¹³	Build frequency tables of each column. (Uses parallelism to go faster if an index is present.)
headers	Show the headers of a CSV. Or show the intersection of all headers between many CSV files.
index	Create an index for a CSV. This is very quick & provides constant time indexing into the CSV file.
input	Read a CSV with exotic quoting/escaping rules.
join¹	Inner, outer, cross, anti & semi joins. Uses a simple hash index to make it fast. (EXTENDED)
jsonl	Convert newline-delimited JSON to CSV. (NEW)
lua	Execute a Lua script over CSV lines to transform, aggregate or filter them. (NEW)
partition	Partition a CSV based on a column value.
pseudo	Pseudonymise the value of the given column by replacing them with an incremental identifier. (NEW)
rename	Rename the columns of a CSV efficiently. (NEW)
replace	Replace CSV data using a regex. (NEW)
reverse²	Reverse order of rows in a CSV. (NEW)
sample¹	Randomly draw rows from a CSV using reservoir sampling (i.e., use memory proportional to the size of the sample). (EXTENDED)
search	Run a regex over a CSV. Applies the regex to each field individually & shows only matching rows. (EXTENDED)
searchset	Run multiple regexes over a CSV in a single pass. Applies the regexes to each field individually & shows only matching rows. (NEW)
select	Select or re-order columns. (EXTENDED)
slice¹²	Slice rows from any part of a CSV. When an index is present, this only has to parse the rows in the slice (instead of all rows leading up to the start of the slice).
sort	Sort CSV data. (EXTENDED)
split¹³	Split one CSV file into many CSV files of N chunks.
stats¹²³	Show basic types & statistics of each column in a CSV. (i.e., sum, min/max, min/max length, mean, stddev, variance, quartiles, IQR, lower/upper fences, skew, median, mode, cardinality & nullcount) (EXTENDED)
table²	Show aligned output of a CSV using elastic tabstops. (EXTENDED)
transpose²	Transpose rows/columns of a CSV. (NEW)

Installation

Binaries for Windows, Linux and macOS are available from Github.

Alternatively, you can compile from source by installing Cargo (Rust's package manager) and installing qsv using Cargo:

cargo install qsv

Compiling from this repository also works similarly:

git clone git://github.com/jqnatividad/qsv
cd qsv
cargo build --release

The compiled binary will end up in ./target/release/qsv.

Tab Completion

qsv's command-line options are quite extensive. Thankfully, since it uses docopt for CLI processing, we can take advantage of doctop.rs' tab completion support to make it easier to use qsv at the command-line (currently, only bash shell is supported):

# install docopt-wordlist
cargo install docopt

# IMPORTANT: run these commands from the root directory of your qsv git repository
# to setup bash qsv tab completion
echo "DOCOPT_WORDLIST_BIN=\"$(which docopt-wordlist)"\" >> $HOME/.bash_completion
echo "source \"$(pwd)/scripts/docopt-wordlist.bash\"" >> $HOME/.bash_completion
echo "complete -F _docopt_wordlist_commands qsv" >> $HOME/.bash_completion

Performance Tuning

CPU Optimization

Modern CPUs have various features that the Rust compiler can take advantage of to increase performance. If you want the compiler to take advantage of these CPU-specific speed-ups, set this environment variable BEFORE installing/compiling qsv:

On Linux and macOS:

export CARGO_BUILD_RUSTFLAGS='-C target-cpu=native'

On Windows Powershell:

$env:CARGO_BUILD_RUSTFLAGS='-C target-cpu=native'

Do note though that the resulting binary will only run on machines with the same architecture as the machine you installed/compiled from.
To find out your CPU architecture and other valid values for target-cpu:

rustc --print target-cpus

Memory Allocator

By default, qsv uses an alternative allocator - mimalloc, a performance-oriented allocator from Microsoft. If you want to use the standard allocator, use the --no-default-features flag when installing/compiling qsv, e.g.:

cargo install qsv --no-default-features

cargo build --release --no-default-features

Buffer size

Depending on your filesystem's configuration (e.g. block size, SSD, file system type, etc.), you can also fine-tune qsv's read/write buffers.

By default, the read buffer size is set to 16k, you can change it by setting the environment variable QSV_RDR_BUFFER_CAPACITY in bytes.

The same is true with the write buffer (default: 32k) with the QSV_WTR_BUFFER_CAPACITY environment variable.

Benchmarking for Performance

Use and fine-tune the benchmark script when tweaking qsv's performance to your environment. Don't be afraid to change the benchmark data and the qsv commands to something that is more representative of your workloads.

Use the generated TSV files to meter and compare performance across platforms. You'd be surprised how performance varies across environments - e.g. qsv's join and scramble operations perform abysmally on Windows's WSL running Ubuntu, with join taking 172.44 seconds and scramble, 237.46 seconds. On the same machine, running in a VirtualBox VM at that with the same Ubuntu version, join takes 1.34 seconds, and scramble 2.14 seconds - two orders of magnitude faster!

However, stats performs two times faster on WSL vs the VirtualBox VM - 2.80 seconds vs 5.33 seconds for the stats_index benchmark.

License

Dual-licensed under MIT or the UNLICENSE.

Sponsor

qsv was made possible by datHere - Data Infrastructure Engineering.
Standards-based, best-of-breed, open source solutions to make your Data Useful, Usable & Used.

Naming Collision

This project is unrelated to Intel's Quick Sync Video.

uses an index when available. join always uses indices. ↩ ↩² ↩³ ↩⁴ ↩⁵ ↩⁶ ↩⁷ ↩⁸
loads the entire CSV into memory. Note that stats & transpose have modes that do not load the entire CSV into memory. ↩ ↩² ↩³ ↩⁴ ↩⁵ ↩⁶
runs parallel jobs by default (use --jobs option to adjust) ↩ ↩² ↩³

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

qsv: Ultra-fast, data-wrangling CLI toolkit for CSVs

Available commands

Installation

Tab Completion

Performance Tuning

CPU Optimization

Memory Allocator

Buffer size

Benchmarking for Performance

License

Sponsor

Naming Collision

About

Releases

Packages

Languages

License

tino097/qsv

Folders and files

Latest commit

History

Repository files navigation

qsv: Ultra-fast, data-wrangling CLI toolkit for CSVs

Available commands

Installation

Tab Completion

Performance Tuning

CPU Optimization

Memory Allocator

Buffer size

Benchmarking for Performance

License

Sponsor

Naming Collision

Footnotes

About

Resources

License

Code of conduct

Security policy

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages