All notable changes to this project will be documented in this file.
The format is based on Keep a Changelog, and this project adheres to Semantic Versioning.
luajit
: new command using LuaJIT, which is much faster than Lua dathere#500
python
: tweaks. Expanded usage text. Only show python version when logging is on. dathere#507fetch
&fetchpost
: apply clippy recommendation https://github.com/jqnatividad/qsv/commit/dd7220bce2811d9e8248c379af5d5c38da3b02d5excel
: usewinfo!
macro https://github.com/jqnatividad/qsv/commit/7211ff214a58394d68c8c7484e8ef4505d75b482- Removed anyhow dependency dathere#508
- Bump actions/stale from 5 to 6 by @dependabot in dathere#505
- Bump sysinfo from 0.26.3 to 0.26.4 by @dependabot in dathere#510
- Cargo update bump several indirect dependencies
- include Python 3.10 shared libraries when publishing for select platforms
- bump MSRV to Rust 1.64.0
- Pin Rust nightly to 2022-09-26
python
: corrected erroneous --helper example. Included hashhelper.py example.extsort
: fixed --help bug (dathere#506)
- Simplify python support. For prebuilt binaries, Python 3.10 is now required and the python 3.10 shared libraries are bundled for select platforms. If you require an earlier version of Python (3.6 and up), you'll have to install/compile from source. dathere#492
- Smarter self update. --update can still be explicitly invoked even when self-update feature has been disabled. Further, if you compiled qsv from source, self-update will only notify you of new releases, instead of proceeding with self-update. dathere#490 and dathere#493
lua
: switch from Lua 5.4 to LuaJIT 2.1, primarily for performance dathere#495lua
: when filtering using floats, "0.0" is falsejoin
: removed unneeded utf8 checksearch
: simplify regex_unicode checkfetch
&fetchpost
: optimize imports; remove unneeded utf8 check- Bump anyhow from 1.0.64 to 1.0.65 by @dependabot in dathere#498
- Bump self_update from 0.31.0 to 0.32.0 by @dependabot in dathere#499
- add additional copyright holder to MIT License
- Improved publishing workflow for prebuilt binaries
- cargo update bumped several dependencies
- pin Rust nightly to 2022-09-14
- fix typos by @kianmeng in dathere#491
python
: better error handling. When mapping/filtering, python expression errors no longer cause a panic, but instead fail to map/filter as expected (when mapping, "<ERROR>" is returned, when filtering, the filter is not applied), and continue processing. Also, other errors are properly propagated instead of panicking. dathere#496lua
: better error handling. When mapping/filtering, Lua errors no longer cause a panic, but instead fail to map/filter as expected (when mapping, "<ERROR>" is returned, when filtering, the filter is not applied), and continue processing. dathere#497
- added
self_update
feature, so users can build qsv without self-update engine dathere#483 and dathere#484
search
&searchset
: --quick option returns first match row to stderr dathere#475python
: make --batch size configurable dathere#485stats
: added more implementation comments; standardize string creationreplace
: add conditional compilation to eliminate dead_code warninglua
: when filtering, non-zero integers are true- refactored
workdir.rs
test helpers - refactored
util:init_logger()
to log command-line arguments - Bump url from 2.3.0 to 2.3.1 by @dependabot in dathere#489
- Bump anyhow from 1.0.63 to 1.0.64 by @dependabot in dathere#478
- Bump sysinfo from 0.26.1 to 0.26.2 by @dependabot in dathere#477
- Bump robinraju/release-downloader from 1.4 to 1.5 by @dependabot in dathere#481
- cargo update bump indirect dependencies
- pin Rust nightly to 2022-09-07
apply
: added Multi-column subcommands by @udsamani in dathere#462stats
: added --round option dathere#474- created
fail_format!
macro for more concise error handling in dathere#471
- Move command usage text to beginning of cmd source code, so we don't need to move around deeplinks to usage texts from README dathere#467
- Optimize conditional compilation of various qsv binary variants, removing dead code dathere#473
fetch
&fetchpost
: removed initial burst of requests, making the commands "friendlier" to rate-limited APIssearch
,searchset
&replace
: minor performance optimizations- created dedicated rustfmt GitHub action workflow to ensure code is always rust formatted. Previously, rustfmt check was in Linux workflow.
- applied some clippy recommendations
- Bump actix-governor from 0.3.1 to 0.3.2 by @dependabot in dathere#461
- cargo update bumped several dependencies
- pin Rust nightly to 2022-08-31
- set RUSTFLAGS to emit=asm when compiling pre-built binaries for performance see http://likebike.com/posts/How_To_Write_Fast_Rust_Code.html#emit-asm
extsort
code was being compiled for qsvdp even if it was not enabled- bump sysinfo from 0.25.2 to 0.26.0, fixing segfault on Apple Silicon
- fixed qsvnp on Windows so it doesn't look for python shared libraries even if python is not enabled
- fixed CliError::Other so it returns bad exitcode (exitcode 1) instead of incorrect_usage (exit code 2)
- @udsamani made their first contribution in dathere#462
- Major refactoring of main variants - removing redundant code and moving them to a new module - clitypes.rs. Added custom exit codes. Removed need to have --exitcode option in several commands as qsv now returns exit codes for ALL commands in a standard way. dathere#460
- Major refactoring of CI test helpers in workdir.rs
py
: use python interning to amortize allocs dathere#457search
&searchset
: return num of matches to stderr; add --quick option; remove --exitcode option dathere#458extsort
: improved error handlingfetch
&fetchpost
: better --report option handling dathere#451lua
: faster number to string conversion using itoa and ryureplace
: removed --exitcode optionsortcheck
: --json options now always cause full scan of CSVstats
: expanded usage text, explicitly listing stats that require loading the entire CSV into memory. Mentioned data type inferences are guaranteed.- cargo update bumped several dependencies
- pin Rust nightly to 2022-08-27
py
: batched python processing refactor. Instead of using one GILpool for one session,py
now processes in batches of 30,000 rows, releasing memory after each batch. This resulted in memory consumption levelling out, instead of increasing to gigabytes of memory with very large files. As an added bonus, this made thepy
command ~30% faster in testing. 😄 dathere#456
- added
sortcheck
command dathere#445 replace
: added --exitcode and --progressbar options
apply
: improved usage textexcel
: replace --list-sheets option with expanded --metadata option dathere#448sortcheck
improvements dathere#447extsort
: improved error handling- progressbar messages are now logged
- bump pyo3 from 0.16 to 0.17
- bump reqwest & redis "patches" further upstream
- cargo update bump several indirect dependencies
- pin Rust nightly to 2022-08-22
extsort
: fixed sysinfo segfault on Apple Silicon by pinning sysinfo to 0.25.2 dathere#446tojsonl
: fixed panic with stdin input
fetchpost
: added formdata to report dathere#434search
&searchset
: added Custom exit codes; --exitcode option dathere#439search
&searchset
: added --progressbar option- progressbars are now optional by default; added QSV_PROGRESSBAR env var to override setting
search
,searchset
&replace
: added mem-limit options for regex-powered commands dathere#440
- Bump jql from 4.0.7 to 5.0.0 by @dependabot in dathere#436
- progressbars are now off by default, and are disabled with stdin input dathere#438
lua
&py
: improved error-handling when loading script filesstats
: changed to using AtomicBool instead of OnceCell, use with_capacity in hot compute loop to minize allocs - hyperfine shows 18% perf increase with these changes- self-update now gives a proper error message when GitHub is rate-limiting updates
- cargo update bump several dependencies
- document MSRV policy
- pin Rust Nightly to 2022-08-16
- fixed stdin input causing an error when progressbars are enabled dathere#438
fetchpost
: new command that uses HTTP POST, as opposed tofetch
- which uses HTTP GET (difference between HTTP GET & POST methods) dathere#431- Added
qsvnp
binary variant to prebuilt binaries - qsv with all the features EXCEPT python
fetch
: refactor report parameter processing dathere#426- Bump serde from 1.0.142 to 1.0.143 by @dependabot in dathere#423
- Bump ahash from 0.7.6 to 0.8.0 by @dependabot in dathere#425
- Bump serial_test from 0.8.0 to 0.9.0 by @dependabot in dathere#428
- Bump anyhow from 1.0.60 to 1.0.61 by @dependabot in dathere#427
- Bump sysinfo from 0.25.1 to 0.25.2 by @dependabot in dathere#429
- Bump actix-governor from 0.3.0 to 0.3.1 by @dependabot in dathere#430
- cargo update bump various indirect dependencies
- pin Rust nightly to 2022-08-11
- change MSRV to 1.63
excel
: fixed empty sheet handling dathere#422
py
: qsv uses the present working directory to find python shared librarypy
: show python version info on startup- publish qsvnp - another binary variant with all features except python
- bumped once_cell from 1.12 to 1.13
- use reqwest upstream with MSRV from 1.49 to 1.56; lazy_static to once_cell
- update calamine fork with chrono time feature disabled
- BetterTOML reformat cargo.toml
- pin Rust nightly to 2022-08-06
excel
: remove unneeded checkutf8 for writer
fetch
: Reformatted report so response is the last column; do not allow --timeout to be zero; progressbar refresh set at 5 times/sec; show name of generated report at the end. dathere#404fetch
: report improvements. Removeqsv_fetch_
column prefix in short report; change progressbar format to default characters dathere#406excel
: make --sheet case-insensitive; better error-handling dathere#416py
: add detected python version to --version option- Only do input utf8-encoding check for commands that need it. dathere#419
- Bump cached from 0.37.0 to 0.38.0 by @dependabot in dathere#407
- Bump anyhow from 1.0.58 to 1.0.59 by @dependabot in dathere#408
- Bump serde from 1.0.140 to 1.0.141 by @dependabot in dathere#409
- Bump ryu from 1.0.10 to 1.0.11 by @dependabot in dathere#414
- Bump anyhow from 1.0.59 to 1.0.60 by @dependabot in dathere#413
- Bump mlua from 0.8.2 to 0.8.3 by @dependabot in dathere#412
- Bump actions/setup-python from 4.1.0 to 4.2.0 by @dependabot in dathere#411
- Bump flexi_logger from 0.22.5 to 0.22.6 by @dependabot in dathere#417
- Bump indicatif from 0.16.2 to 0.17.0
- Bump chrono from 0.4.19 to 0.4.20
- Bump qsv-dateparser from 0.4.2 to 0.4.3
- pin Rust nightly to 2022-08-03
- fixed double progressbars dathere#405
- fix utf8 encoding check to resolve #410 dathere#418
fetch
: add elapsed time, retries to reports; add --max-retries option dathere#395
lua
: better error messages dathere#399python
: better error messages dathere#400fetch
: improved error handling dathere#402stats
: improve performance by usingunwrap_unchecked
in hot compute loop- Bump indicatif from 0.16.2 to 0.17.0 dathere#403
- Bump mlua from 0.8.1 to 0.8.2 by @dependabot in dathere#394
- Bump console from 0.15.0 to 0.15.1 by @dependabot in dathere#398
- Bump grex from 1.3 to 1.4
- Cargo update bump various dependencies
- pin Rust nightly to 2022-07-29
excel
: fixed --sheet option bounds checking dathere#401
fetch
: add redis --flushdb option dathere#387fetch
: add --report & --cache-error options. --report creates a separate report file, detailing the URL used, the response, the HTTP status code, and if its a cache hit. --cache-error is used to also cache errors - i.e. identical fetches will return the cached error. Otherwise, fetch will request the URL again. dathere#393
fetch
: fast defaults. Now tries to go as fast as possible, leveraging dynamic throttling (using RateLimit and Rety-After headers) but aborting after 100 errors. Also added a separate error progress bar. dathere#388- Smarter
tojsonl
. Now scans CSV file and infers data types and uses the appropriate JSON data type dathere#389 tojsonl
is also multithreaded dathere#392stats
: use unwrap_unchecked for even more performance dathere#390fetch
: refactor dynamic throttling dathere#391- Bump sysinfo from 0.24.6 to 0.24.7 by @dependabot in dathere#384
- cargo bump update several dependencies
- pin Rust nightly to 2022-07-23
fetch
: fix --http-header parsing bug dathere#386
- added
tojsonl
command - CSV to JSONL dathere#380 excel
: additional --date-whitelist modes dathere#368fetch
: added Redis connection pooling dathere#373
python
: remove unneeded python3.dll generation dathere#379stats
: minor performance tweaksfetch
: minor performance tweaks - larger/faster in-mem cache- Bump cached from 0.34.1 to 0.37.0 by @dependabot in dathere#367 and dathere#381
- Bump regex from 1.5.6 to 1.6.0 by @dependabot in dathere#369
- Bump reverse_geocoder from 3.0.0 to 3.0.1 by @dependabot in dathere#377
- Bump actions/setup-python from 4.0.0 to 4.1.0 by @dependabot in dathere#376
- Bump serde from 1.0.138 to 1.0.139 by @dependabot in dathere#374
- cargo update bump several dependencies
- larger logfiles (from 1mb to 10mb) before rotating
- apply select clippy recommendations
- pin Rust nightly to 2022-07-13
- Use option_env! macro to trap errors dathere#378
- Pin Rust nightly to 2022-07-02
- fixed redis dev-dependency which mistakenly added a non-existent ahash feature. This prevented publishing of qsv 0.58.1 to crates.io.
- Universal clippy handling. Added allow clippy hint section in main for clippy lints we allow/ignore, and added exceptions as needed throughout the codebase. This means clippy, even in pedantic/nursery/perf mode will have no warnings. dathere#365
- reqwest deflate compression support dathere#366
fetch
: expanded --http-header explanation/examplefetch
: refactored --timeout processing https://github.com/jqnatividad/qsv/commit/3454ed068f0f243473a0f66520f90f55ece4bf49fetch
: prioritized ACCEPT-ENCODING to prioritize brotli first, gzip second, and deflate last for compression https://github.com/jqnatividad/qsv/commit/c540d22b630df424a8516bb07af9bbf80150d67b- updated patched crates, particularly our rust-csv fork with more clippy recommendations applied
- cargo update bump actix-http from 3.2.0 to 3.2.1
excel
: fixed docopt usage text which prevents --help from workingextsort
: better parsing/error-handling, instead of generic panic when no input/output is specified. This also allows --help to be displayed.
excel
: add --list-sheets option dathere#364fetch
: added 0 option to --rate-limit to go as fast as possible.
CAUTION: Only use this with APIs that have RateLimit headers so qsv can automatically down-throttle as required. Otherwise, the fetch job will look like a Denial-of-Service attack. https://github.com/jqnatividad/qsv/commit/e4ece60aea3720b872119ca7a8ad3666dad033e7fetch
: added --max-errors option. Maximum number of errors before aborting
- progress bars now display per_sec throughput while running jobs, not just at the end of a job
fetch
: for long-running fetch jobs, the progress bar will update at least every three seconds, so it doesn't look like the job is frozen/stuck.fetch
: added additional verbiage to usage text on how to pass multiple key-value pairs to the HTTP headerfetch
: made RateLimit jitters (required to avoid thundering herd issues as per the RateLimit spec) shorter, as they were too long.- pin Rust nightly to 2022-07-01
- applied various clippy recommendations
- bumped serde from 1.0.137 to 1.0.138
- added stale warning to benchmarks. The benchmarks have not been updated since qsv 0.20.0.
- cargo update bumped several other dependencies
- remove unneeded sleep pause before fetch ratelimit test
fetch
: higher default settings which makes fetch much faster
excel
: date support dathere#357- added hardware survey reminiscent of Steam's Hardware Survey. Only sent when checking for updates with no personally identifiable information. dathere#358
fetch
: ensure URLs are properly encoded dathere#359
- Bump jql from 4.0.4 to 4.0.5 by @dependabot in dathere#356
- cargo bump update several dependencies
- change MSRV to Rust 1.62.0
- pin Rust Nightly to 2022-06-29
fetch
: is single-threaded again. It turns out it was more complicated than I hoped. Will revisit making it multi-threaded once I sort out the sync issues.
fetch
is now multithreaded! 🚀🚀🚀 - with threadsafe memoized caching, dynamic throttling & http2 adaptive flow control dathere#354
fetch
: do more expensive ops behind cache dathere#355- applied BetterTOML formatting to Cargo.toml
exclude
,flatten
&join
: applied clippy recommendation for borrow_deref_ref https://github.com/jqnatividad/qsv/commit/bf1ac90185947a6d923613f17c4af616631dc149utils
: minor cleanup of version fn https://github.com/jqnatividad/qsv/commit/217702b51785f51d6924608a5122c405ff384fefvalidate
: perf tweak - use collect_into_vec to reduce allocationsapply
: perf tweak - use collect_into_vec to reduce allocations- removed
thiserror
dependency - pin Rust Nightly to 2022-06-19
- Bump robinraju/release-downloader from 1.3 to 1.4 by @dependabot in dathere#351
- Bump crossbeam-channel from 0.5.4 to 0.5.5 by @dependabot in dathere#352
- Bump redis patch
- cargo update bump several other dependencies
fetch
: better error handling dathere#353
fetch
: performance tweaks dathere#350- Bump titlecase from 1.1.0 to 2.0.0 by @dependabot in dathere#349
- Bump sysinfo from 0.24.3 to 0.24.4
fetch
: convert non-persistent cache from an Unbound cache to a Sized LRU cache, so we don't run out of memory if the file being processed is very large and cache hits are low. https://github.com/jqnatividad/qsv/commit/4349fc9389a32c0d9544be824d1f42b1af65974d
fetch
: preemptively throttle down before we hit the ratelimit quota
fetch
: add "dynamic throttling". If response header has rate-limit or retry-after fields, fetch will dynamically throttle itself as needed. dathere#348
- cargo update bump dependencies
- Pin Rust nightly to 2022-06-14
fetch
: more robust/consistent error handling dathere#347- removed reqwest 0.11.10 patch and used reqwest 0.11.11
- Pin Rust nightly to 2022-06-13
- Pin Rust nightly to 2022-06-12
fetch
: fix invalid jsonl response dathere#346
apply
: now multithreaded with rayon (up to 10x 🚀🚀🚀 faster!) dathere#342
apply
: refactor hot loop to use enums instead of nested if dathere#343sniff
: more idiomatic vec loop https://github.com/jqnatividad/qsv/commit/2a70134bf45f9485bcbb27579f92f89abb7b6bb1validate
: optimizations (up to 20% 🚀 faster) https://github.com/jqnatividad/qsv/commit/0f0be0aba0a6d0cd10f5c96fd17ffd726d3231d1excel
: optimize trimming https://github.com/jqnatividad/qsv/commit/780206a575d40cf759abd295aa91da640e5febed- various whirlwind tour improvements (more timings, flows/reads better, removed non-sequiturs)
- improved progress bar prep (unstyled progress bar is not momentarily displayed, standardized across cmds)
- bumped reqwest patch to latest upstream https://github.com/jqnatividad/qsv/commit/cb0eb1717f07d8481211e289e6762d9b994fac18
- Bump actions/setup-python from 3.1.2 to 4.0.0 by @dependabot in dathere#339
- Bump mlua from 0.7.4 to 0.8.0 by @dependabot in dathere#340
- fixed error-handling in util::count_rows() dathere#341
- do not panic when index is stale https://github.com/jqnatividad/qsv/commit/36dbd79591e3ae1e9c271ec3c0272599cc8695de
fetch
: fixed docopt arg processing so --help text is displayed properly https://github.com/jqnatividad/qsv/commit/0cbf7017ebc7f28fa67951133e3bac7d2c7a1368excel
: more robust error handling https://github.com/jqnatividad/qsv/commit/413c693320653d085b5cca48ca32b0d371ccd240
stats
: added outer fences to help identify extreme and mild outliers dathere#337
stats
: change skewness algorithm to use quantile-based measures- whirlwind tour: added more stats about stats command; updated stats output with the additional columns
- pin nightly to 2022-06-07
- cargo update bump several dependencies
- fixed stats quartile tests, as the results were being prematurely truncated, causing in false negative test results
stats
: changed--dates-whitelist
option to use "all" instead of "<null>"; better usage text; more perf tweaks; more tests dathere#334stats
: mem alloc tweaks & date-inferencing optimization dathere#333apply
: improved usage text about --formatstr https://github.com/jqnatividad/qsv/commit/2f18565caec6c6e900f776c5f6f3e1adf4c9b6e1sample
: added note about why we don't need crypto secure random number generators https://github.com/jqnatividad/qsv/commit/3384d1a9630bc1033ff67db5dcbf48c067e97728excel
&slice
: avoid panic by replacingabs
withunsigned_abs
https://github.com/jqnatividad/qsv/commit/7e2b14a5de67e70ee0b26ea0eff83462dbc77a0a- turn on once_cell
parking_lot
feature for storage efficiency/performance https://github.com/jqnatividad/qsv/commit/849548cde8bc9c2d96ddf464f2578faf63d6e9cf - applied various cargo +nightly clippy optimizations
- pin nightly build to Rust Nightly 2022-06-04
- made various optimizations to our csv fork https://github.com/BurntSushi/rust-csv/compare/master...jqnatividad:perf-tweaks
- cargo bump updated several dependencies
- added
QSV_PREFER_DMY
environment variable. dathere#331
- reorganized Environment Variables section in README https://github.com/jqnatividad/qsv/commit/f25bbf0361fcb7b960d45590ca35b2e676a4497d
- logging: longer END snippet to make it easier to match START/END pairs
- added Boston 311 sample data to tests
- Bump uuid from 1.1.0 to 1.1.1 by @dependabot in dathere#332
- cargo update bumped packed_simd_2 from 0.3.7 to 0.3.8
- Instead of panicking, do proper error-handling on IO errors when checking utf-8 encoding https://github.com/jqnatividad/qsv/pull/331/commits/37b4482aae77563995f13a15f73ca8849df6a27d
- added qsv GitHub social media image which
stats
: added sum integer overflow handling. If sum overflows, instead of panicking, the value 'OVERFLOW' is returned- upgraded to faster qsv_dateparser 0.4.2, which parses the slash_dmy/slash_mdy date formats earlier in the parse tree, which has more prevalent usage.
- nightly builds are now bundled into the main distribution zip archive.
- renamed qsv_rust_version_info.txt to qsv_nightly_rust_version.info.txt in the distribution zip archive to make it clearer that it only pertains to nightly builds
- cargo bump update several dependencies
- nightly distribution zip archives have been removed, now that the nightly builds are in the main zip archive.
stats
: prefer_dmy date-parsing preference was not used when computing date min/maxstats
: prefer_dmy setting was not initialized properly the first time its called- nightly build self-update now works properly, now that they are bundled into the main distribution zip archive
apply
: DATEFMT subcommand now has a--prefer-dmy
option. dathere#328stats
andschema
: add--prefer-dmy
option. dathere#329sniff
: can now sniff Date and Datetime data types. dathere#330sniff
: added toqsvdp
- DataPusher+-optimized qsv binary- added DevSkim security linter Github Action to CI
- applied various clippy pedantic and nursery recommendations
- cargo bump updated several dependencies, notably qsv-dateparser with its new DMY format parsing capability and qsv-sniffer with its Date and Datetime data type detection
- Closed all cargo-audit findings(dathere#167), as the latest
qsv-dateparser
eliminated qsv'schrono
dependency. - Properly create
qsv_rust_version_info.txt
in nightly builds - Fixed multithreading link in Features Flag section
sniff
: sniff field names as well in addition to field data types in dathere#317sniff
: intelligent sampling. In addition to specifying the number of first n rows to sample, when--sample
is between 0 and 1 exclusive, its treated as a percentage of the CSV to sample (e.g. 0.20 is 20 percent). If its zero, the entire file is sampled. dathere#318schema
: add --stdout option in dathere#321stats
: smart date inferencing with field-name date whitelist. Also did some minor tweaks for a little more performance in dathere#327rename
: added toqsvdp
- DataPusher+-optimized qsv binary
- Switch to qsv_sniffer fork of csv_sniffer. qsv_sniffer has several optimizations (field name sniffing, utf-8 encoding detection,
SIMD speedups, etc.) that enabled the added
sniff
features above. dathere#320 - Bump uuid from 1.0.0 to 1.1.0 by @dependabot in dathere#323
- Improved Performance Tuning section with more details about UTF-8 encoding, and Nightly builds
- Updated list of commands that use an index
- cargo update bump dependencies, notably jql 4.0.3 to 4.0.4, and cookie_store from 0.16.0 to 0.16.1
- pinned Rust Nightly to 2022-05-23. Later Rust Nightly releases "broke" packed-simd dependency which prevented us from building qsv's nightly build. (see apache/arrow-rs#1734)
- disable simd acceleration feature on our csv-sniffer fork so we can publish on crates.io
input
: added--auto-skip
CSV preamble option in dathere#313sniff
: support non-utf8 files; flexible detection now works; rename --len to --sample in dathere#315sniff
: addedis_utf8
property in dathere#316- added RFC4180 section to README
validate
: improve RFC4180 validation messages in dathere#309stats
: nullcount is a "streaming" statistic and is now on by default in dathere#311schema
: refactored stdin processing- Made logging more consistent in dathere#314
- bumped MSRV to Rust 1.61.0
- use a qsv-optimized fork of csv-sniffer (https://github.com/jqnatividad/csv-sniffer/tree/non-utf8-qsv), that fixes flexible detection, reads non-utf8 encoded files, reports if a file is utf8-encoded, and uses SIMD/CPU features to accelerate performance.
- applied select pedantic clippy recommendations
- bumped several dependencies, notably regex from 1.5.5 to 1.5.6
py
: enabledabi3
feature properly, so qsv now works with higher versions of python over v3.8
validate
: add--json
&--pretty-json
options for RFC4180 check in dathere#303qsvdp
: addvalidate
command in dathere#306- added rust nightly version info to nightly builds
- apply select clippy::pedantic recommendations in dathere#305
- Bump actions/checkout from 2 to 3 by @dependabot in dathere#300
sniff
andvalidate
json errors are now JSONAPI compliant- cargo update bump several dependencies
- removed unused debian package publishing workflow
sniff
: preamble and rowcount fixes in dathere#301schema
: fixed stdin bug in dathere#304
- Fixed conditional compilation directives that caused qsvdp build to fail.
dedup
: add--sorted
option in dathere#286sniff
: add--json
and--pretty-json
options in dathere#297- added rust version info to nightly build zip files so users can see which Rust nightly version was used to build the nightly binaries
stats
: added more--infer-dates
tests- number of processors used now logged when logging is on
python
: nightly build optimization in dathere#296- moved Performance Tuning to its own markdown file, and included it in the TOC
- bumped several dependencies, notably
rayon
,jsonschema
andpyo3
- moved FAQ from Wiki to Discussions
- added clone count badge
python
: should now work with python 3.8, 3.9.or 3.10
dedup
andsort
are now multithreaded with rayon in dathere#283- add
--jobs
option toschema
andvalidate
in dathere#284
--jobs
andQSV_MAX_JOBS
settings also now work with rayon- cargo update bump several dependencies
- upgrade
calamine
fork patch that enablesexcel
command - removed
target-cpu=native
in nightly builds so they are more portable
- fixed
publish-nightly
workflow bugs so nightly builds are built properly - corrected several build instructions errors in README
- fixed
workdir:output_stderr()
helper so it also returns std_err message - fixed
Rust Beta
workflow so we can also manually test against Rust Beta
extsort
: increased performance. Use 10% of total memory or if total mem is not detectable, 100 mb for in-mem sorting. Increased R/W buffer size to 1mb e2f013fsearchset
: more idiomatic rust fa1f340- added "Nightly Release Builds" section in README Performance Tuning
- cargo update bump several dependencies
excel
: fixed off by +1 row count (we were counting the header as well); added column count to final message and removed useless human-readable option. c99df2533b5c112d90c6e04068227b7f873459c2- fixed various bugs in Publish Nightly GitHub Action that automatically built nightly binaries
- Added release nightly binaries, optimized for size and speed
- uses Rust nightly
- also compiles stdlib, so build-time optimizations also apply, instead of using pre-built stdlib
- set
panic=abort
- removing panic-handling, formatting and backtrace code from binaries - set
RUSTFLAGS= -C target-cpu=native
to enable use of additional CPU-level features - enables unstable/nightly features on
regex
andrand
crates
- Added testing on nightly to CI
dedup
: reduced memory footprint by half by writing directly to disk, rather than storing in working mem, before writingexcel
: show sheet name in message along with row count; let docopt take care of validating mandatory arguments- More whirlwind tour improvements - how timings were collected, footnotes, etc.
- Bump github/codeql-action from 1 to 2 by @dependabot in dathere#277
- Bump log from 0.4.16 to 0.4.17 by @dependabot in dathere#278
- Bump whatlang from 0.15 to 0.16
- Make file extension processing case-insensitive in dathere#280
- Added Caching section to Performance Tuning
- Added UTF-8 section to Performance Tuning
- removed unneeded header file for wcp.csv used in Whirlwind Tour, now that we have a well-formed wcp.csv
- added
headers
command to qsvdp binary
- cargo update bump semver from 1.0.7 to 1.0.8
- added rust-clippy GH action workflow
- added security policy
extsort
: use util::njobs to process --jobs option- various improvements on Whirlwind tour to help users follow along
extsort
: add link to "External Sorting" wikipedia articleextsort
: made and mandatory docopt argumentssort
: mentionextsort
in usage text- added markdownlint.json config to suppress noisy markdown lints in VSC
- reformatted README to apply some markdown lints
- bump whatlang from 0.14 to 0.15
- bump qsv-stats from 0.3.6 to 0.3.7 for some minor perf improvements
- Added
extsort
command - sort arbitrarily large text files\CSVs using a multi-threaded external sort algorithm.
- Updated whirlwind tour with simple
stats
step py
: Automatically create python3.dll import libraries on Windows targets- Updated build instructions to include
full
feature index
: mention QSV_AUTOINDEX env var in usage text- Corrected minor typos
- Bump jql from 4.0.1 to 4.0.2 by @dependabot in dathere#276
- cargo update bump several dependencies - notably mimalloc
- Created new binary - qsvdp - binary optimized for DataPusher+ in dathere#273 qsvdp only has DataPusher+ relevant commands, with the self-update engine removed. This results in a binary that's 3x smaller than qsvlite, and 6x smaller than qsv will all features enabled.
dedup
: send dupe count to stderr in dathere#272dedup
: improve usage text- cargo update bump several crates
count
: corrected usage text typo
input
can now effectively transcode non-utf-8 encoded files to utf-8 in dathere#271
table
: made it flexible - i.e. each row can have varying number of columnsexcel
: remove unneeded closure
- use our grex fork, as the upstream fork has an unpublished version number that prevents us from publishing on crates.io.
- use
[patch.crates-io]
to use crate forks, rather than using the git directive in the dependencies section. This has the added benefit of making the dependency tree smaller, as other crates that depend on the patched crates also use the patches. This should also result in smaller binaries.
input
refactor. Added trimming and epilog skiplines option. dathere#270sniff
: added note about sniff limitations- also publish x86_64-unknown-linux-musl binary
- Bump anyhow from 1.0.56 to 1.0.57 by @dependabot in dathere#268
- Bump jsonschema from 0.15.2 to 0.16.0
- use optimized fork of rust-csv, with non-allocating, in-place trimming and various perf tweaks
- use optimized fork of docopt.rs, with various perf & memory allocation tweaks
- use reqwest fork with unreleased changes that remove unneeded crates
validate
: usefrom_utf8_unchecked
in creating json instances for performance
input
: Fixed line-skipping logic so CSV parsing is flexible - i.e. column count can change between records
input
: add--skip-lines
option in dathere#266
- More verbose, matching START/END logging messages when
QSV_LOG_LEVEL
is enabled. - Bump whatlang from 0.13.0 to 0.14.0 by @dependabot in dathere#264
- Bump filetime from 0.2.15 to 0.2.16 by @dependabot in dathere#263
- Bump uuid from 0.8 to 1 in dathere#267
- Minor documentation improvements
cargo update
bumped several other second-level dependencies
- Bump pyo3 from 0.16.3 to 0.16.4
stats
: renamed--dates
option to--infer-dates
stats
: fixed panic caused by wrong type inference when--infer-dates
option is on in dathere#256
- Datapusher tweaks, primarily to help with datapusher error-handling in dathere#255
excel
: exported count with--human-readable
option- use calamine fork to bump dependencies, and reduce binary size
- Bump rayon from 1.5.1 to 1.5.2 by @dependabot in dathere#254
- Bump jql from 4.0.0 to 4.0.1
- removed unnecessary *.d dependency files from published binaries zip
- use performance tweaked forks of csv crate
- Made
this_error
dependency optional withfetch
feature - Made
once_cell
dependency optional withapply
andfetch
features
- Fixed qsv binary publishing. qsv binary was not built properly, it was built using a qsvlite profile
excel
command in dathere#249 and dathere#252
- Bump jql from 3.3.0 to 4.0.0 by @dependabot in dathere#251
- Bump actions/setup-python from 3.1.1 to 3.1.2 by @dependabot in dathere#250
- added version to grex dependency as its required by crates.io, though we're still using the grex fork without the CLI components.
QSV_AUTOINDEX
environment variable. When set, autoindexes csv files, autoupdates stale indicesreplace
: <NULL>--replacement
option (dathere#244)- qsv now automatically screens files for utf-8 encoding. Set
QSV_SKIPUTF8_CHECK
env var to skip encoding check. (dathere#245 and dathere#248)
foreach
: refactored. (dathere#247)- Bump jql from 3.2.3 to 3.3.0
- Bump actions/setup-python from 3.1.0 to 3.1.1 by @dependabot in dathere#246
- use grex fork to remove unneeded CLI dependencies
- qsv requires UTF-8/ASCII encoded files. Doing so allows us to squeeze more performance by removing UTF-8 validation in dathere#239 and dathere#240
- fixed
--jobs
parameter parsing for multithreaded commands in dathere#236 and dathere#237
- Handle/log self-update errors in dathere#233
fetch
andapply
: use cheaper, faster lookup tables for dynamic formatting in dathere#231- Cleanup - remove commented code; convert
match
toif let
; more pedantic clippy recommendations, etc. in dathere#232
enumerate
: added--constant
sentinel value in dathere#219fetch
: added--jqlfile
option in dathere#220stats
: more perf tweaks in dathere#223
fetch
: argument parsing refactor, removing need for dummy argument in dathere#222- applied select pedantic clippy recommendations in dathere#224
- simplified multithreading - removed jobs div by three heuristic in dathere#225
- use qsv-dateparser fork of dateparser for increased performance of
stats
,schema
andapply
in dathere#230 - Bump actions/checkout from 2.3.3 to 3 by @dependabot in dathere#228
- Bump actions/stale from 3 to 5 by @dependabot in dathere#227
- Bump actions/setup-python from 2 to 3.1.0 by @dependabot in dathere#226
validate
: use user agent & compression settings when fetching jsonschema from a URL in dathere#207- Build and publish smaller qsvlite binary in dathere#208, dathere#210 & dathere#213
sniff
: now works with stdin in dathere#211 and dathere#212stats
: remove smartstring in dathere#214- various performance tweaks in
stats
andselect
- README: Installation - git:// is no longer supported by GitHub by @harrybiddle in dathere#205
- README: Fixed wrong footnote for feature flags
- Silent error when an index file is not found is now logged (https://github.com/jqnatividad/qsv/commit/7f2fe7f3259fb74a8d76396dcc2aa71585967b9b)
- bumped self-update to 0.29. This partly addresses #167, as self-update had an indirect dependency to
time
0.1.43.
sniff
: new command to quickly detect CSV metadata in dathere#202- auto-delimiter setting with
QSV_SNIFF_DELIMITER
environment variable in dathere#203 apply
: newdynfmt
multi-column, dynamic formatting subcommand in dathere#200fetch
: new multi-column dynamic formatting with --url-template option in dathere#196
fetch
: --url-template safety tweaks in dathere#197fetch
: automatically minify JSON responses. JSON can still be pretty-printed with --pretty option in dathere#198fetch
is now an optional feature in dathere#201sniff
: improved display in dathere#204- slim down dev-dependencies
py
: now checks if first character of a column is a digit, and replaces it with an underscore
- README: Added datHere logo
py
: ensure valid python variable names dathere#192fetch
: dev-dependency actix upgrade (actix-governor from 0.2->0.3; actix-web from 3.3->4.0) dathere#193lua
: replace hlua with mlua dathere#194stats
: refactor for performance - skip from_utf8 check as input is utf8 transcoded as necessary; smartstring dathere#195- Whirlwind Tour: show country-continent.csv file with comment handling
- cargo bump update several dependencies
stats
: only compute quartiles/median for int/float fields - dathere#195
- README: note about
--output
option changing delimiter automatically based on file extension and UTF-8 encoding the file - README: Windows usage note about UTF16-LE encoding and
--output
workaround
- upgraded regex to 1.5.5 which resolves the GHSA-m5pq-gvj9-9vr8 security advisory
count
:--human-readable
option in dathere#184- Automatic utf8 transcoding in dathere#187
- Added NYC School of Data 2022 presentation
- Added ahash 0.7 and encoding_rs_io 0.1 dependencies
- Use ahash::AHashMap instead of std::collections::HashMap for performance in dathere#186
- Revamped Whirlwind Tour
- bumped several dependencies
- anyhow 1.0.55 to 1.0.56
- ipnet 2.3.1 to 2.4.0
- pyo3 0.16.0 to 0.16.1
py
: convert spaces to underscores for valid python variable names when Column names have embedded spaces in dathere#183- docs: CSV Kit got a 10x improvement by @jpmckinney in dathere#180
fetch
: added jql selector to cache key- Corrected README mixup re
join
hashmap indices and qsv indices
- @jpmckinney made their first contribution in dathere#180
stats
: added--dates
option. This option turns on date/datetime data type inferencing, which is a very expensive operation. Only use this option when you have date/datetime fields and you want to compile the proper statistics for them (otherwise, they will be treated as "String" fields.)
- added intentionally kitschy qsv logo 😁
stats
: addeddatetime
data type inferencingfetch
: added optional Redis response cachingschema
: added--strict-dates
option by @mhuang74 in dathere#177validate
: added more robust RFC 4180-compliance checking when no jsonschema is provided- added Redis to CI
- bumped reverse-geocoder crate from 2.0.1 to 3.0.0 to modernize geonames reverse geocoder
- bumped cached crate from 0.30.0 to 0.33.0 to enable Redis response caching
- bumped various other dependencies to latest release
- removed invalid
--path
cargo install option in README workdir.rs
was not properly cleaning up test files
fetch
: add--url-template
and--redis
options in dathere#175stats
: addDateTime
data type (RFC3339 format) in dathere#176- added Rust Beta to Github Actions CI
validate
: improve performance and simplify error report format by @mhuang74 in dathere#172- Addl
validate
performance tweaks in dathere#173 - changed MSRV to latest Rust stable - 1.59.0
- removed
num_cpus
crate and use newstd::thread::available_parallelism
stabilized in Rust 1.59.0 - use new cargo.toml
strip
option to strip binaries - refactored GitHub Actions CI to make it faster
schema
(#60): pattern constraint for string types by @mhuang74 in dathere#168validate
: improve performance by @mhuang74 in dathere#170fetch
: Spell out k:v -> key:value in docopt usage text- cargo update bump several dependencies
validate
: bug fix and refactor by @mhuang74 in dathere#171
fetch
: upgrade to jql 3.1.0 by @mhuang74 in dathere#160schema
: refactor tests by @mhuang74 in dathere#161schema
: support Enum constraint by @mhuang74 in dathere#162schema
: default to include value constraints by @mhuang74 in dathere#166- bumped
qsv-stats
to 0.3.6 forstats
&frequency
performance tweaks - specify that
apply geocode
expects WGS84 coordinate system - cargo update bump several dependencies
- changed CI to run clippy and rustfmt automatically
schema
: Fix bug with enum by @mhuang74 in dathere#163
schema
POC by @mhuang74 in dathere#155schema
: add value constraints via stats by @mhuang74 in dathere#158schema
: update command description by @mhuang74 in dathere#159
stats
data type inference changed to more straightforward "String" from "Unicode"- changed CI/CD to use rust-cache GitHub Actions making it ~3x faster.
- always build and test with
--locked
flag. This allows us to use rust-cache and guarantee that builds are using the exact dependency versions qsv requires. - bumped
qsv-stats
to 0.3.5 forstats
performance tweaks
- Validate: bug fixes by @mhuang74 in dathere#154
- Validate: bug fixes by @mhuang74 in dathere#151
- Python 3.8 (current stable version) is now required for the
py
command. Changed from Python 3.7. - bumped jsonschema dependency to to 0.15.
- always build/publish with
--locked
flag in CI/CD. - enclose environment variable values with double quotes when using
--envlist
option - use more captured identifiers in format strings.
- added
--helper
option topy
command. This allows users to load a python user helper script as a module namedqsv_uh
. Example - added support for last N records in
slice
command by allowing negative values for theslice --start
option. - added progress bar to
py
command.
- convert more format strings to use captured identifiers
- bump jsonschema to 0.14.0. This will allow cross-compilation to work again as we can explicitly use rustls for reqwest. This is required as cross no longer bundles openssl.
- fixed broken self-update (#150)
validate
command by @mhuang74 in dathere#145- README: additional information on xsv fork differences
- bumped MSRV to 1.58.1
validate
tweaks in dathere#148validate
buffered jsonl error report in dathere#149
- fix
fetch
bugs by @mhuang74 in dathere#146 - README: added missing
--path
option incargo install
- refactored
--update
to give update progress messages; run on--help
as well - updated README
- remove bold formatting of commands
- expanded descriptions of
- fixlengths
- foreach
- jsonl
- py
- searchset
- added reason why pre-built binaries on some platforms do not have the python feature installed.
- drop use of "parallelism", just say "multithreading"
- expanded Feature Flag section
- bump cached from 0.26 to 0.29
- added
update_cache_info!
macro to util.rs, replacing redundant code for progress indicators with cache info - bump MSRV to Rust 1.58
- use new Rust 1.58 captured identifiers for format strings
- added
output_stderr
test helper to test for expected errors in CI - added tests for invalid delimiter length; truncated comment char and unknown apply operators
- pointed documentation to Github README instead of doc.rs
- added
rustup update
to Github Actions publish workflow as Github's runners are still on Rust 1.57 - added Debian package build to publish workflow for
x86_64-unknown-linux-musl
- corrected help text on job divisor is 3 not 4 for multithreaded commands (
frequency
,split
andstats
) - corrected
stats
help text to state that multithreading requires an index
fetch
: enable cookies and storing error messages by @mhuang74 in dathere#141fetch
: improve jql integration by @mhuang74 in dathere#139--envlist
option now returns all qsv-relevant environment variables in dathere#140- Move logging and update utility functions to util.rs in dathere#142
fetch
: support custom http headers by @mhuang74 in dathere#143- bumped whatlang to 13.0 which supports Tagalog detection
- improved documentation of feature flags, environment variables &
stats
command
- added JSONL/NDJSON to Recognized File Formats (thru
jsonl
command) - added CODE_OF_CONDUCT.md
- removed WIP indicator from
fetch
in README
- Fetch: support rate limiting by @mhuang74 in dathere#133
- Runtime minimum version check for Python 3.7 if
python
feature is enabled by @jqnatividad in dathere#138 - Fine-tuned GitHub Actions publish workflow for pre-built binaries
- removed upx compression, as it was creating invalid binaries on certain platforms
- enabled
python
feature on x86_64 platforms as we have access to the Python interpreter on GitHub's Action runners - include both
qsv
andqsvlite
in the distribution zip file
- Formatted Cargo.toml with Even Better TOML VS code extension
- changed Cargo.toml categories and keywords
- removed patch version number from Cargo.toml dependencies. Let cargo do its semver dependency magic, and we include the Cargo.lock file anyway.
- added example of Python f-string formatting to
py
help text - added Python f-string formatting test
- Added note in README about enabled features in pre-built binaries
- Removed NEW and EXTENDED indicators in README
- changed publish workflow for apple targets to use Xcode 12.5.1 from 12.4
jsonl
command now recognize and process JSON arrays--version
option now shows binary name and enabled features- Use upgraded
qsv_currency
fork to powerapply currencytonum
operation. Now supports currency strings (e.g. USD, EUR, JPY, etc) in addition to currency symbols (e.g. $, €, ¥, etc) - renamed
QSV_COMMENTS
environment variable toQSV_COMMENT_CHAR
to make it clear that it clear that we're expecting a single character, not a boolean as the old name implies.
- added
create_from_string
helper function in workdir.rs - compress select pre-built binaries with UPX
qsvlite
binary target, with all features disabled.py
command. Evaluates a Python expression over CSV lines to transform, aggregate or filter them.
- removed Debian package publishing workflow, as the GH action for it does not support Rust 2021 edition
- automatic self-update version check when the
--list
option is invoked. QSV_NO_UPDATE
environment variable to prohibit self-update checks.
- explicitly include
deflate
compression method for self_update. Otherwise,--update
unzipping doesn't work.
- explicitly include
deflate
compression method for self_update. Otherwise,--update
unzipping doesn't work.
fetch
refinements. Still WIP, but usable (See #77)- add default user agent
fetch
progress bar--jobs
,--throttle
,--header
,--store-error
andcookies
options still not functional.
- cargo update bump several crates to their latest releases. Of note are
test-data-generation
,self_update
andjql
where we worked with the crate maintainers directly with the update.
--update
bug fixed. It was not finding the binary to self update properly.
fetch
command by @mhuang74. Note that the command is functional but still WIP, that's why this is a beta release.- Download badge for GitHub pre-built binaries
- Compute hashes for pre-built binaries for verification
- Additional helptext for
apply
NLP functions - standardized on canonical way to suppress progress bars with
--quiet
option - README: Mentioned
--frozen
option when installing/building qsv; wordsmithing - rustfmt; clippy
- remove obsolete Makefile and .gitsubmodules
- changed selfupdate dependency to use pure Rust TLS implementation as cross no longer bundles OpenSSL, causing some binary builds using cross to fail.
- Add logging by @mhuang74 in dathere#116
- Environment variables for logging -
QSV_LOG_LEVEL
andQSV_LOG_DIR
- see Logging for more details. sentiment
analysisapply
operation by @jqnatividad in dathere#121whatlang
language detectionapply
operation by @jqnatividad in dathere#122- aarch64-apple-darwin prebuilt binary (Apple Silicon AKA M1)
--envlist
convenience option to list all environment variables with theQSV_
prefix
- changed
MAX_JOBS
heuristic logical processor divisor from 4 to 3 selfupdate
is no longer an optional feature
- @mhuang74 made their first contribution in dathere#116
- added
--update
option. This allows qsv to check and update itself if there are new release binaries published on GitHub. - added
--envlist
option to show all environment variables with theQSV_
prefix. apply
,generate
,lua
,foreach
andselfupdate
are now optional features.apply
andgenerate
are marked optional since they have large dependency trees;lua
andforeach
are very powerful commands that can be abused to issue system commands. Users now have the option exclude these features from their local builds. Published binaries on GitHub still have-all-features
enabled.- added
QSV_COMMENTS
environment variable (contributed by @jbertovic). This allows qsv to ignore lines in the CSV (including headers) that start with the set character. EXAMPLES - catch input empty condition when qsv's input is empty when using
select
.
(e.g.cat /dev/null | qsv select 1
will now show the error "Input is empty." instead of "Selector index 1 is out of bounds. Index must be >= 1 and <= 0.") - added
--pad <arg>
option tosplit
command to zero-pad the generated filename by the number of<arg>
places. EXAMPLES - tests for
QSV_COMMENTS
,split --pad
,select
input empty condition,
- set Cargo.toml to Rust 2021 edition
- added "command-line-utilities" category to crates.io metadata
- cargo update bumped
mimalloc
,serde_json
,syn
,anyhow
andryu
. - GitHub Actions CI tests runs with
--all-features
enabled. - published binaries on GitHub have
--all-features
enabled by default. - made geocode caching a tad faster by making the transitional cache unbounded, and simplifying the key.
--version
now also shows the number of logical CPUs detected.- project-wide rustfmt
- documentation for features,
QSV_COMMENTS
andapply
- removed greetings.yml workflow from GitHub Actions.
- added
lua
andforeach
feature flags. These commands are very powerful and can be easily abused or get into "foot-shooting" scenarios. They are now only enabled when these features are enabled during install/build. censor
andcensor_check
now support the addition of custom profanities to screen for with the --comparand option.
- removed
lazy_static
and usedonce_cell
instead - smaller stripped binaries for
x86_64-unknown-linux-gnu
,i686-unknown-linux-gnu
,x86_64-apple-darwin
targets - expanded
apply
help text - added more tests (currencytonum, censor, censor_check)
generate
command. Generate test data by profiling a CSV using a Markov decision process.- add
--no-headers
option torename
command (see discussion #81) - Auto-publish binaries for more platforms on release
- added combo-test for sort-dedup-sort (see discussion #80)
- New environment variables galore
QSV_DEFAULT_DELIMITER
- single ascii character to use as delimiter. Overrides--delimeter
option. Defaults to "," (comma) for CSV files and "\t" (tab) for TSV files, when not set. Note that this will also set the delimiter for qsv's output. Adapted from xsv PR by @camerondavison.QSV_NO_HEADERS
- when set, the first row will NOT be interpreted as headers. SupersedesQSV_TOGGLE_HEADERS
.QSV_MAX_JOBS
- number of jobs to use for parallelized commands (currentlyfrequency
,split
andstats
). If not set, max_jobs is set to number of logical processors divided by four. See Parallelization for more info.QSV_REGEX_UNICODE
- if set, makessearch
,searchset
andreplace
commands unicode-aware. For increased performance, these commands are not unicode-aware and will ignore unicode values when matching and will panic when unicode characters are used in the regex.
- Added parallelization heuristic (num_cpus/4), in connection with
QSV_MAX_JOBS
. - Added more tests
apply
(test for regex_replace, eudex, and lat/long parsing)- combo-test (see above) - for testing qsv command combinations
- tests for
QSV_NO_HEADERS
environment variable - tests for
QSV_REGEX_UNICODE
environment variable insearch
,searchset
andreplace
commands - tests for
QSV_DEFAULT_DELIMITER
environment variable
- MSRV of Rust 1.56
- expanded
apply
help-text examples - progress bar now only updates every 1% progress by default
- replaced English-specific soundex with multi-lingual eudex algorithm (see https://docs.rs/crate/eudex/0.1.1)
- refactored
apply geocode
subcommand to improve cache performance - improved lat/long parsing - can now recognize embedded coordinates in text
- changed
apply operations regex_replace
behavior to do all matches in a field, instead of just the left-most one, to be consistent with the behavior ofapply operations replace
- added
apply geocode
caching, more than doubling performance in the geocode benchmark. - added
--random
and--seed
options tosort
command from @pjsier. - added qsv tab completion section to README.
- additional
apply operations
subcommands:- Match Trim operations - enables trimming of more than just whitespace, but also of multiple trim characters in one pass (Example):
- mtrim: Trims
--comparand
matches left & right of the string (trim_matches wrapper) - mltrim: Left trim
--comparand
matches (trim_start_matches wrapper) - mrtrim: Right trim
--comparand
matches (trim_end_matches wrapper)
- mtrim: Trims
- replace: Replace all matches of a pattern (using
--comparand
) with a string (using--replacement
) (Std::String replace wrapper). - regex_replace: Replace the leftmost-first regex match with
--replacement
(regex replace wrapper). - titlecase - capitalizes English text using Daring Fireball titlecase style https://daringfireball.net/2008/05/title_case
- censor_check: check if profanity is detected (boolean) Examples
- censor: profanity filter
- Match Trim operations - enables trimming of more than just whitespace, but also of multiple trim characters in one pass (Example):
- added parameter validation to
apply operations
subcommands - added more robust parameter validation to
apply
command by leveraging docopt - added more tests
- added
rust-version
in Cargo.toml to specify MSRV of rust 1.56
- revamped benchmark script:
- allow binary to be changed, so users can benchmark xsv and other xsv forks by simply replacing the $bin shell variable
- now uses a much larger data file - a 1M row, 512 mb, 41 column sampling of NYC's 311 data
- simplified and cleaned-up script now that it's just using 1 data file
- Upgrade rand and quickcheck crates to latest releases (0.8.4 and 1.0.3 respectively), and modified code accordingly.
cargo update
bumped addr2line (0.16.0->0.17.0), backtrace (0.3.62->0.3.63), gimli (0.25.0->0.26.1) and anyhow (1.0.44->1.0.45)
- removed
scramble
command as its function is now subsumed by thesort
command with the--random
and--seed
options - removed
num-format
crate which has a large dependency tree with several old crates; replaced with much smallerthousands
crate. - removed 1M row, 48mb, 7 column world_cities_pop_mil.csv as its no longer used by the revamped benchmark script.
- removed
build.rs
build dependency that was checking for MSRV of Rust >= "1.50". Instead, took advantage of newrust-version
Cargo.toml option introduced in Rust 1.56.
- added string similarity operations to
apply
command:- simdl: Damerau-Levenshtein similarity
- simdln: Normalized Damerau-Levenshtein similarity (between 0.0 & 1.0)
- simjw: Jaro-Winkler similarity (between 0.0 & 1.0)
- simsd: Sørensen-Dice similarity (between 0.0 & 1.0)
- simhm: Hamming distance. Number of positions where characters differ.
- simod: OSA Distance.
- soundex: sounds like (boolean)
- added progress bars to commands that may spawn long-running jobs - for this release,
apply
,foreach
, andlua
. Progress bars can be suppressed with--quiet
option. - added progress bar helper functions to utils.rs.
- added
apply
to benchmarks. - added sample NYC 311 data to benchmarks.
- added records per second (RECS_PER_SEC) to benchmarks
- major refactoring of
apply
command:- to take advantage of docopt parsing/validation.
- instead of one big command, broke down apply to several subcommands:
- operations
- emptyreplace
- datefmt
- geocode
- simplified lat/long regex validator to no longer validate range, as the underlying geocoder function validates it already - 18% geocode speedup.
- bumped docopt back up to 1.1.1.
- improved error message when specifying an invalid apply operation.
- new
scramble
command. Randomly scrambles a CSV's records. - read/write buffer capacity can now be set using environment variables
QSV_RDR_BUFFER_CAPACITY
andQSV_WTR_BUFFER_CAPACITY
(in bytes). - added additional test for
apply datefmt
.
- default read buffer doubled from 8k to 16k.
- default write buffer doubled from 32k to 64k.
- benchmark script revamped. Now produces aligned output onscreen, while also creating a benchmark TSV file; downloads the sample file from GitHub; benchmark more commands.
- version info now also returns memory allocator being used, and number of cpus detected.
- minor refactor of
enumerate
,explode
,fill
andforeach
commands.
- removed benchmark data from repository. Moved to GitHub wiki instead.
- use docopt v1.1.0 instead of docopt v.1.1.1 for docopt to support all regex features
- added
mimalloc
feature flag. mimalloc is Microsoft's performance-oriented memory allocator. Earlier versions of qsv used mimalloc by default. Now it is only used when the feature is set. - README: Added Performance section.
- README: Document how to enable
mimalloc
feature.
- README: Explicitly show how to set environment variables on different platforms.
stats
mode
is now also multi-modal -i.e. returns multiples modes when detected. e.g. mode[1,1,2,2,3,4,6,6] will return [1,2,6]. It will continue to return one mode if there is only one detected.stats
quartile
now also computes IQR, lower/upper fences and skew (using Pearson's median skewness). For code simplicity, calculated skew with quartile.join
now also supportleft-semi
andleft-anti
joins, the same way Spark does.search
--flag
option now returns row number, not just '1'.searchset
--flag
option now returns row number, followed by a semi-colon, and a list of matching regexes.- README: Added badges for Security Audit, Discussion & Docs
- README: Added FAQ link in fork note.
- point to https://docs.rs/crate/qsv for documentation.
- README:
stats
andjoin
section updated with new features. - README: wordsmithing - replaced "CSV data" and "CSV file/s" with just "CSV".
- in
stats
changedq2
column name toq2_median
. - removed debug symbols in release build for smaller binaries.
- minor refactoring of
search
,searchset
&stats
.
- README: fixed
flatten
example.
- removed Rust badge.
- added sample regexset file for PII-screening.
apply geocode --formatstr
now accepts less US-centric format selectors.searchset --flag
now shows which regexes match as a list (e.g. "[1, 3, 5]"), not just "1" or "0".
foreach
command now returns error message on Windows.foreach
still doesn't work on Windows (yet), but at least it returns "foreach command does not work on Windows.".apply geocode
was not accepting valid lat/longs below the equator. Fixed regex validator.- more robust
searchset
error handling when attempting to load regexset files. apply
link on README was off by one.
- bumped
dateparser
to 0.1.6. This now allowsapply datefmt
to properly reformat dates without a time component. Before, when reformatting a date like "July 4, 2020", qsv returns "2020-07-04T00:00:00+00:00". It now returns "2020-07-04". - minor clippy refactoring
- removed rust-stats submodule introduced in 0.17.1. It turns out crates.io does not allow publishing of crates with local dependencies on submodules. Published the modified rust-stats fork as qsv-stats instead. This allows us to publish qsv on crates.io
- removed unused
textwrap
dependency
- explicitly specified embedded modified rust-stats version in Cargo.toml.
- added
searchset
command. Run multiple regexes over CSV data in a single pass. - added
--unicode
flag tosearch
,searchset
andreplace
commands. Previously, regex unicode support was on by default, which comes at the cost of performance. And sinceqsv
optimizes for performance ("q is for quick"), it is now off by default. - added quartiles calculation to
stats
. Pulled in upstream pending PRs from @m15a to implement.
- changed variance algorithm. For some reason, the previous variance algorithm was causing intermittent test failures on macOS. Pulled in pending upstream PR from @ruppertmillard.
- embedded rust-stats fork submodule which implements quartile and new variance algorithm.
- changed GitHub Actions to pull in submodules.
- the project was not following semver properly, as several new features were released in the 0.16.x series that should have been MINOR version bumps, not PATCH bumps.
- added
geocode
operation toapply
command. It geocodes to the closest city given a column
with coordinates in Location format ('latitude, longitude') using a static geonames lookup file.
(see https://docs.rs/reverse_geocoder) - added
currencytonum
operation toapply
command. - added
getquarter.lua
helper script to supportlua
example in Cookbook. - added
turnaroundtime.lua
helper script to compute turnaround time. - added
nyc311samp.csv
to provide sample data for recipes. - added several Date Enrichment and Geocoding recipes to Cookbook.
- fixed
publish.yml
Github Action workflow to properly create platform specific binaries. - fixed variance test to eliminate false positives in macOS.
- added
docs
directory. For README reorg, and to add detailed examples per command in the future. - added
emptyreplace
operation toapply
command. - added
datefmt
operation toapply
command. - added support for reading from stdin to
join
command. - setup GitHub wiki to host Cookbook and sundry docs to encourage collaborative editing.
- added footnotes to commands table in README.
- changed GitHub Actions publish workflow so it adds the version to binary zip filename.
- changed GitHub Actions publish workflow so binary is no longer in
target/release
directory. - reorganized README.
- moved whirlwind tour and benchmarks to
docs
directory. - use zipped repo copy of worldcitiespop_mil.csv for benchmarks.
- fixed links to help text in README for
fixlengths
andslice
cmds exclude
not listed in commands table. Added to README.
- Removed
empty0
andemptyNA
operations inapply
command. Replaced withemptyreplace
.
- changed Makefile to remove github recipe as we are now using GitHub Actions.
- Applied rustfmt to entire project #56
- Changed stats variance test as it was causing false positive test failures on macOS (details)
- removed
-amd64
suffix from binaries built by GitHub Actions.
- fixed publish Github Actions workflow to zip binaries before uploading.
- removed
.travis.yml
as we are now using GitHub Actions. - removed scripts
build-release
,github-release
andgithub-upload
as we are now using GitHub Actions. - removed
ci
folder as we are now using GitHub Actions. - removed
py
command. #58
- Bumped qsv version to 0.16.1. Inadvertently released 0.16.0 with qsv version still at 0.15.0.
-
Added a CHANGELOG.
-
Added additional commands/options from @Yomguithereal xsv fork.
apply
- Apply series of string transformations to a CSV column.behead
- Drop headers from CSV file.enum
- Add a new column enumerating rows by adding a column of incremental or uuid identifiers. Can also be used to copy a column or fill a new column with a constant value.explode
- Explode rows into multiple ones by splitting a column value based on the given separator.foreach
- Loop over a CSV file to execute bash commands.jsonl
- Convert newline-delimited JSON to CSV.lua
- Execute a Lua script over CSV lines to transform, aggregate or filter them.pseudo
- Pseudonymise the value of the given column by replacing them by an incremental identifier.py
- Evaluate a Python expression over CSV lines to transform, aggregate or filter them.replace
- Replace CSV data using a regex.sort
--uniq option - When set, identical consecutive lines will be dropped to keep only one line per sorted value.search
--flagcolumn
option - If given, the command will not filter rows but will instead flag the found rows in a new column namedcolumn
.
-
Added conditional compilation logic for
foreach
command to only compile ontarget_family=unix
as it has a dependency onstd::os::unix::ffi::OsStrExt
which only works in unix-like OSes. -
Added
empty0
andemptyNA
operations toapply
command with corresponding test cases. -
Added GitHub Actions to check builds on
ubuntu-latest
,windows-latest
andmacos-latest
. -
Added GitHub Action to publish binaries on release.
-
Added
build.rs
build-dependency to check that Rust is at least at version 1.50.0 and above.
- reformatted README listing of commands to use a table, and to link to corresponding help text.
- Removed appveyor.yml as qsv now uses GitHub Actions.
dedup
cmd from @ronohm.table
cmd--align
option from @alex-ozdemir.fmt
cmd--quote-never
option from @niladic.exclude
cmd from @lalaithion- Added
--dupes-output
option todedup
cmd. - Added datetime type detection to
stats
cmd. - Added datetime
min/max
calculation tostats
cmd. - es-ES translation from @ZeliosAriex.
- Updated benchmarks script.
- Updated whirlwind tour to include additional commands.
- Made whirlwind tour reproducible by using
sample
--seed
option.
- Fixed
sample
percentage sampling to be always reproducible even if sample size < 10% when using--seed
option. - Fixed BOM issue with tests, leveraging unreleased xsv fix.
- Fixed count help text typo.
- Removed
session.vim
file.
- Performance: enabled link-time optimization (
LTO="fat"
). - Performance: used code generation units.
- Performance: used mimalloc allocator.
- Changed benchmark to compare xsv 0.13.0 and qsv.
- Changed chart from png to svg.
- Performance: Added note in README on how to optimize local compile
by setting
target-cpu=native
.
- Renamed fork to qsv.
- Revised highlight note explaining reason for qsv renamed fork in README.
- Added (NEW) and (EXPANDED) notations to command listing.
- Adapted to Rust 2018 edition.
- used serde derive feature.
Initial fork from xsv.
rename
cmd from @Kerollmops.fill
cmd from @alexrudy.transpose
cmd from @mintyplanet.select
cmd regex support from @sd2k.stats
cmd--nullcount
option from @scpike.- added percentage sampling to
sample
cmd.
- Updated README with additional commands.