A fast, flexible, and memory-efficient command-line tool written in Rust for processing large CSV files. Inspired by xsv and built on Polars, it's designed for handling tens or hundreds of gigabytes of CSV data efficiently in workflows like log analysis and digital forensics.
Note
The original version of this project was implemented in Python and can be found at sumeshi/quilter-csv. This Rust version is a complete rewrite.
- Pipeline-style command chaining: Chain multiple commands in a single line for fast and efficient data processing
- Flexible filtering and transformation: Perform operations like select, filter, sort, deduplicate, and timezone conversion
- YAML-based batch processing (Quilt): Automate complex workflows using YAML configuration files
To see available commands and options, run qsv
without any arguments:
$ qsv -h
Here's an example of reading a CSV file, extracting rows that contain 4624 in the 'Event ID' column, and displaying the top 3 rows sorted by the 'Date and Time' column:
$ qsv load Security.csv - isin 'Event ID' 4624 - sort 'Date and Time' - head 3 - showtable
This command:
- Loads
Security.csv
- Filters rows where
Event ID
is 4624 - Sorts by
Date and Time
- Shows the first 3 rows as a table
Quilter-CSV commands are composed of three types of steps:
- Initializer: Loads data (e.g.,
load
) - Chainable: Transforms or filters data (e.g.,
select
,grep
,sort
, etc.) - Finalizer: Outputs or summarizes data (e.g.,
show
,showtable
,headers
, etc.)
Each step is separated by a hyphen (-
):
$ qsv <INITIALIZER> <args> - <CHAINABLE> <args> - <FINALIZER> <args>
Note: If no finalizer is explicitly specified, showtable
is automatically used as the default finalizer, making it easy to quickly view results:
$ qsv load data.csv - select col1,col2 - head 5
# Equivalent to:
$ qsv load data.csv - select col1,col2 - head 5 - showtable
Load one or more CSV files.
Parameter | Type | Default | Description |
---|---|---|---|
path | list[str] | One or more paths to CSV files. Glob patterns are supported. Gzip files (.gz) are automatically detected and decompressed. | |
-s, --separator | str | , |
Field separator character. |
--low-memory | flag | false |
Enable low-memory mode for very large files. |
--no-headers | flag | false |
Treat the first row as data, not headers. When enabled, columns will be named automatically (column_0, column_1, etc.). |
--chunk-size | int | (auto) | Number of rows to read per chunk. Controls memory usage during file processing. Smaller values use less memory but increase I/O operations. |
Example:
$ qsv load data.csv
$ qsv load data.csv.gz
$ qsv load data1.csv data2.csv data3.csv
$ qsv load "logs/*.tsv" -s \t
$ qsv load logs/*.tsv --separator=\t
$ qsv load data.csv --low-memory
$ qsv load data.csv --no-headers
$ qsv load data.csv --chunk-size 50000
Select columns by name, numeric index, or range notation.
Parameter | Type | Default | Description |
---|---|---|---|
colnames | str/list/range | Column name(s) or indices. Supports multiple formats (see examples below). This is a required argument. |
Column Selection Formats:
- Individual columns:
col1,col3
- Select specific columns by name - Numeric indices:
1,3
- Select columns by position (1-based indexing) - Range notation (hyphen):
col1-col3
- Select range using hyphen - Range notation (colon):
col1:col3
- Select range using colon - Numeric range:
2:4
- Select 2nd through 4th columns (e.g., col1, col2, col3) - Quoted colon notation:
"col:1":"col:3"
- For column names containing colons - Mixed formats:
1,col2,4:6
- Combine different selection methods
$ qsv load data.csv - select datetime # Select single column by name
$ qsv load data.csv - select col1,col3 # Select specific columns by name
$ qsv load data.csv - select col1-col3 # Select range using hyphen
$ qsv load data.csv - select col1:col3 # Select range using colon
$ qsv load data.csv - select 1 # Select 1st column (datetime)
$ qsv load data.csv - select 2:4 # Select 2nd-4th columns (col1, col2, col3)
$ qsv load data.csv - select 2,4 # Select 2nd and 4th columns (col1, col3)
$ qsv load data.csv - select "col:1":"col:3" # For columns with colons in names
$ qsv load data.csv - select 1,datetime,3:5 # Mixed selection methods
Filter rows where a column matches any of the given values.
Parameter | Type | Default | Description |
---|---|---|---|
colname | str | Column name to filter. Required. | |
values | list | Comma-separated values. Filters rows where the column matches any of these values (OR condition). Required. |
$ qsv load data.csv - isin col1 1
$ qsv load data.csv - isin col1 1,4
Filter rows where a column contains a specific literal substring.
Parameter | Type | Default | Description |
---|---|---|---|
colname | str | Column name to search. Required. | |
substring | str | The literal substring to search for. Required. | |
-i, --ignorecase | flag | false |
Perform case-insensitive matching. |
$ qsv load data.csv - contains str ba
$ qsv load data.csv - contains str BA -i
$ qsv load data.csv - contains str BA --ignorecase
Replace values in column(s) using a Regex pattern.
Parameter | Type | Default | Description |
---|---|---|---|
pattern | str | Regex pattern to search for. Required. | |
replacement | str | Replacement string. Required. | |
--column | str | (all) | Apply replacement to specific column only. If not specified, applies to all columns. |
-i, --ignorecase | flag | false |
Perform case-insensitive matching. |
$ qsv load data.csv - sed foo foooooo # Replace 'foo' with 'foooooo' in all columns
$ qsv load data.csv - sed foo foooooo --column str # Replace 'foo' with 'foooooo' in 'str' column only
$ qsv load data.csv - sed FOO foooooo -i # Case-insensitive replacement in all columns
$ qsv load data.csv - sed ".*o.*" foooooo --column str # Regex replacement in specific column
Filter rows where any column matches a regex pattern.
Parameter | Type | Default | Description |
---|---|---|---|
pattern | str | Regex pattern to search for in any column. Required. | |
-i, --ignorecase | flag | false |
Perform case-insensitive matching. |
-v, --invert-match | flag | false |
Invert the sense of matching, to select non-matching lines. |
Example:
$ qsv load data.csv - grep foo
$ qsv load data.csv - grep "^FOO" -i # Case-insensitive search
$ qsv load data.csv - grep "^FOO" --ignorecase # Long form case-insensitive
$ qsv load data.csv - grep "^FOO" -i -v # Case-insensitive inverted match
$ qsv load data.csv - grep "^FOO" --ignorecase --invert-match # Long form inverted match
Displays the first N rows of the dataset.
Parameter | Type | Default | Description |
---|---|---|---|
number | int | Number of rows to display. This is a required positional argument. |
$ qsv load data.csv - head 3
$ qsv load data.csv - head 10
Displays the last N rows of the dataset.
Parameter | Type | Default | Description |
---|---|---|---|
number | int | Number of rows to display. This is a required positional argument. |
$ qsv load data.csv - tail 3
$ qsv load data.csv - tail 10
Sorts the dataset based on the specified column(s).
Parameter | Type | Default | Description |
---|---|---|---|
colnames | str/list | Column name(s) to sort by. Comma-separated for multiple columns (e.g., col1,col3 ) or a single column name. Required. |
|
-d, --desc | flag | false |
Sort in descending order. Applies to all specified columns. |
$ qsv load data.csv - sort str
$ qsv load data.csv - sort str -d
$ qsv load data.csv - sort str --desc
$ qsv load data.csv - sort col1,col2,col3 --desc
Count duplicate rows, grouping by all columns. Results are automatically sorted by count in descending order.
Parameter | Type | Default | Description |
---|---|---|---|
(None) | Takes no arguments. Automatically sorts output by count column in descending order. |
$ qsv load data.csv - count
$ qsv load data.csv - count - sort col1 # Count and then sort by col1 instead
Filters unique rows, removing duplicates based on all columns.
Parameter | Type | Default | Description |
---|---|---|---|
(None) | Takes no arguments. Removes duplicate rows based on all columns. |
$ qsv load data.csv - uniq
Changes the timezone of a datetime column.
Parameter | Type | Default | Description |
---|---|---|---|
colname | str | Name of the datetime column. Required. | |
--from-tz | str | Source timezone (e.g., UTC , America/New_York , local ). Required. |
|
--to-tz | str | Target timezone (e.g., Asia/Tokyo ). Required. |
|
--input-format | str | auto |
Input datetime format string (e.g., %Y-%m-%d %H:%M:%S%.f ). auto uses intelligent parsing similar to Python's dateutil.parser, supporting fuzzy parsing and automatic format detection. |
--output-format | str | auto |
Output datetime format string (e.g., %Y/%m/%d %H:%M:%S ). auto uses ISO8601 format %Y-%m-%dT%H:%M:%S%.7f%:z (100-nanosecond precision for Windows forensics). |
--ambiguous | str | earliest |
Strategy for ambiguous times during DST transitions: earliest (first occurrence) or latest (second occurrence). |
Understanding --ambiguous
option:
During Daylight Saving Time (DST) transitions in autumn, clocks "fall back" creating duplicate hours. For example, 2:30 AM occurs twice:
- First time: 2:30 AM DST (before transition)
- Second time: 2:30 AM Standard Time (after transition)
When encountering such ambiguous times:
earliest
: Uses the first occurrence (DST time)latest
: Uses the second occurrence (Standard time)
Example:
$ qsv load data.csv - changetz datetime --from-tz UTC --to-tz Asia/Tokyo
# Output: 2023-01-01T09:00:00.123456+09:00 (ISO8601 with microsecond precision)
$ qsv load data.csv - changetz datetime --from-tz UTC --to-tz America/New_York --input-format "%Y/%m/%d %H:%M" --output-format "%Y-%m-%d %H:%M:%S"
# Custom output format
$ qsv load data.csv - changetz datetime --from-tz America/New_York --to-tz UTC --ambiguous latest
# Handle ambiguous DST times
# Automatic format detection (similar to Python dateutil.parser):
$ qsv load logs.csv - changetz timestamp --from-tz local --to-tz UTC
# Handles: "Jan 15, 2023 2:30 PM", "2023/01/15 14:30", "15-Jan-2023 14:30:00", etc.
# Fuzzy parsing with embedded text:
$ qsv load events.csv - changetz event_time --from-tz EST --to-tz UTC
# Handles: "Meeting on January 15th, 2023 at 2:30 PM", "Call scheduled for Jan 15 2023"
TODO: Upgrade to 7-digit sub-second precision (100-nanosecond precision for Windows FILETIME compatibility) when chrono-tz library supports it.
Renames a specific column.
Parameter | Type | Default | Description |
---|---|---|---|
old_name | str | The current column name. Required. | |
new_name | str | The new column name. Required. |
$ qsv load data.csv - renamecol current_name new_name
Converts data formats between JSON, YAML, and XML. Also supports formatting/prettifying data in the same format.
Parameter | Type | Default | Description |
---|---|---|---|
colname | str | Column name containing the data to convert. Required. | |
--from | str | Source format: json , yaml , or xml . Required. |
|
--to | str | Target format: json , yaml , or xml . Required. |
Supported conversions:
- Cross-format:
json ↔ yaml
,json ↔ xml
,yaml ↔ xml
- Same-format (formatting):
json → json
,yaml → yaml
,xml → xml
Features:
- Automatically handles malformed JSON with extra quotes
- Prettifies and formats data for better readability
- Preserves data structure during conversion
Example:
$ qsv load data.csv - convert json_col --from json --to yaml
$ qsv load data.csv - convert config --from yaml --to json
$ qsv load data.csv - convert data --from json --to xml
$ qsv load data.csv - convert messy_json --from json --to json # Format/prettify JSON
$ qsv load data.csv - convert compact_yaml --from yaml --to yaml # Format YAML
Aggregates data by time intervals, creating time-based summaries.
Parameter | Type | Default | Description |
---|---|---|---|
time_column | str | Name of the datetime column to use for time bucketing. Required. | |
--interval | str | Time interval for aggregation (e.g., 1h , 30m , 5s , 1d ). Required. |
|
--sum | str | Column name to sum within each time bucket. Optional. | |
--avg | str | Column name to average within each time bucket. Optional. | |
--min | str | Column name to find minimum within each time bucket. Optional. | |
--max | str | Column name to find maximum within each time bucket. Optional. | |
--std | str | Column name to calculate standard deviation within each time bucket. Optional. |
Features:
- Creates a time bucket column named
timeline_{interval}
(e.g.,timeline_1h
,timeline_30m
) - If no aggregation column is specified, only row counts are provided for each time bucket
- Supports various time interval formats: hours (
1h
), minutes (30m
), seconds (5s
), days (1d
)
Example:
$ qsv load access.log - timeline timestamp --interval 1h
# Creates column: timeline_1h
$ qsv load metrics.csv - timeline time --interval 5m --avg cpu_usage
# Creates columns: timeline_5m, count, avg_cpu_usage
$ qsv load sales.csv - timeline date --interval 1d --sum amount
# Creates columns: timeline_1d, count, sum_amount
$ qsv load server.log - timeline timestamp --interval 30s --max response_time
# Creates columns: timeline_30s, count, max_response_time
Filters data based on time ranges, extracting records within specified time boundaries.
Parameter | Type | Default | Description |
---|---|---|---|
time_column | str | Name of the datetime column to filter on. Required. | |
--start | str | Start time (inclusive). Optional. | |
--end | str | End time (inclusive). Optional. |
At least one of --start
or --end
must be specified. Supports various datetime formats including ISO8601, timestamps, and common log formats.
Example:
$ qsv load data.csv - timeslice timestamp --start "2023-01-01 00:00:00"
$ qsv load data.csv - timeslice timestamp --end "2023-12-31 23:59:59"
$ qsv load data.csv - timeslice timestamp --start "2023-06-01" --end "2023-06-30"
$ qsv load access.log - timeslice timestamp --start "2023-01-01T10:00:00"
Creates pivot tables with cross-tabulation functionality.
Parameter | Type | Default | Description |
---|---|---|---|
--rows | str | Comma-separated list of columns for rows. Optional. | |
--cols | str | Comma-separated list of columns for columns. Optional. | |
--values | str | Column to aggregate values from. Required. | |
--agg | str | Aggregation function: sum , mean , count , min , max , median , std . Optional (default behavior depends on implementation). |
At least one of --rows
or --cols
must be specified. Creates a cross-tabulation table with specified row and column groupings, aggregating values using the chosen function.
Example:
$ qsv load sales.csv - pivot --rows region --cols product --values sales_amount --agg sum
$ qsv load data.csv - pivot --rows category --cols year --values revenue --agg mean
$ qsv load logs.csv - pivot --rows date --cols error_type --values count --agg count
$ qsv load metrics.csv - pivot --rows department --values performance --agg median
Rounds datetime values to specified time units, creating a new rounded column while preserving the original.
Parameter | Type | Default | Description |
---|---|---|---|
colname | str | Name of the datetime column to round. Required. | |
--unit | str | Time unit for rounding: y /year , M /month , d /day , h /hour , m /minute , s /second . Required. |
|
--output | str | (replaces original) | Name for the output column. If not specified, replaces the original column. |
Features:
- Rounds datetime values down to the nearest specified time unit boundary
- Useful for time-based grouping and analysis
- Supports both short (
h
,d
) and long (hour
,day
) unit names - Output format automatically adjusts to the specified unit (clean, minimal format)
Output formats by unit:
- year (y):
2023
- month (M):
2023-01
- day (d):
2023-01-01
- hour (h):
2023-01-01 12
- minute (m):
2023-01-01 12:34
- second (s):
2023-01-01 12:34:56
Example:
$ qsv load data.csv - timeround timestamp --unit d --output date_only
# Input: 2023-01-01 12:34:56
# Output: 2023-01-01
$ qsv load data.csv - timeround timestamp --unit h --output hour_rounded
# Input: 2023-01-01 12:34:56
# Output: 2023-01-01 12
$ qsv load logs.csv - timeround timestamp --unit m
# Rounds to minute boundary, replaces original column
$ qsv load metrics.csv - timeround created_at --unit year --output created_year
# Input: 2023-01-01 12:34:56
# Output: 2023
Finalizers are used to output or summarize the processed data. They are typically the last command in a chain.
Splits data into separate CSV files based on unique values in a specified column. Each unique value creates its own file.
Parameter | Type | Default | Description |
---|---|---|---|
colname | str | Column name to partition by. Required. | |
output_directory | str | Directory to save partitioned files. Required. |
The output directory will be created if it doesn't exist. Each file is named after the unique value in the partition column (with invalid filename characters replaced by underscores).
Example:
$ qsv load data.csv - partition category ./partitions/
$ qsv load sales.csv - partition region ./by_region/
$ qsv load logs.csv - partition date ./daily_logs/
$ qsv load data.csv - select col1,col2 - partition col1 ./numeric_partitions/
Displays the column headers of the current dataset.
Parameter | Type | Default | Description |
---|---|---|---|
-p, --plain | flag | false |
Display headers as plain text, one per line, instead of a formatted table. |
Example:
$ qsv load data.csv - headers
$ qsv load data.csv - headers -p
$ qsv load data.csv - headers --plain
Displays summary statistics for each column in the dataset (e.g., count, null_count, mean, std, min, max).
This command does not take any arguments or options.
Example:
$ qsv load data.csv - stats
Displays the Polars LazyFrame query plan. This is useful for debugging and understanding the operations being performed.
This command does not take any arguments or options.
Example:
$ qsv load data.csv - select col1 - showquery
Displays the resulting data as CSV to standard output. Header is included by default.
This command does not take any arguments or options.
Example:
$ qsv load data.csv - head 5 - show
Displays the resulting data in a formatted table to standard output. Shows table dimensions and intelligently truncates large datasets.
Features:
- Displays table size information (rows × columns) like Python Polars
- For datasets with 8+ rows: shows first 3 rows, truncation indicator (
…
), and last 3 rows - For datasets with 7 or fewer rows: shows all rows without truncation
- Automatically used as default finalizer when no explicit finalizer is specified
This command does not take any arguments or options.
Example:
$ qsv load data.csv - select col1,col2 - head 3 - showtable
# Output includes: shape: (3, 2) followed by formatted table
$ qsv load large_data.csv - select col1,col2
# Automatically calls showtable if no finalizer specified
Outputs the processing results to a CSV file.
Parameter | Type | Default | Description |
---|---|---|---|
-o, --output | str | File path to save the CSV data. Required. | |
-s, --separator | char | , |
Field separator character for the output CSV file. |
Example:
$ qsv load data.csv - head 100 - dump -o results.csv
$ qsv load data.csv - head 100 - dump --output results.csv
$ qsv load data.csv - head 100 - dump -o results.csv -s ';'
Quilt allows you to define complex data processing workflows in YAML configuration files. This is useful for automating repetitive tasks or creating reusable data processing pipelines.
The quilt
command itself takes the path to a YAML configuration file. Input data sources and other parameters are typically defined within the YAML file.
$ qsv quilt <config_file_path.yaml> [options]
Parameter | Type | Description |
---|---|---|
config_file_path.yaml | str | Path to the YAML configuration file defining the pipeline stages. Required. |
-o, --output | str | Overrides the output path defined in the YAML config for the final dump operation (if any). |
$ qsv quilt rules/my_workflow.yaml
$ qsv quilt rules/my_analysis.yaml -o custom_output.csv
The YAML configuration file (e.g., rules/my_workflow.yaml
) defines the stages and steps. For example, the Sample YAML (rules/test.yaml)
below defines a pipeline that:
- Loads data (implicitly or explicitly via a
load
step in aprocess
stage). - Performs selections and a join operation across different stages.
- Displays the final result as a table.
Within a Quilt YAML file, stages can be of different types to orchestrate the flow.
Operation Type | Description | Key Parameters |
---|---|---|
process |
Executes a series of qsv operations on a dataset. | steps : Dictionary of operations (e.g., load , select , head , showtable ). Each key is a qsv command, and its value contains arguments/options. source (optional): Specifies the output of a previous stage as input. |
concat |
Concatenates multiple datasets (stages). | sources : List of stage names whose outputs to concatenate. params.how (optional): Method for concatenation, e.g., vertical (default), horizontal . |
join |
Joins datasets from multiple stages based on keys. | sources : List of two stage names whose outputs to join. params.left_on /params.right_on or params.on : Column(s) for joining. params.how (optional): Join type (inner , left , outer , cross ). |
Download the latest release from GitHub Releases.
git clone https://github.com/sumeshi/qsv-rs.git
cd qsv-rs
cargo build --release
This project includes a comprehensive test suite to ensure all functionality works correctly.
To run the complete test suite (132 tests covering all features):
$ python3 tests/run_tests.py
This will execute all tests for:
- Initializers (1 test module):
load
- Chainables (17 test modules):
select
,isin
,contains
,sed
,grep
,head
,tail
,sort
,count
,uniq
,changetz
,renamecol
,convert
,timeline
,timeslice
,pivot
,timeround
- Finalizers (7 test modules):
show
,showtable
,headers
,stats
,showquery
,dump
,partition
- Quilters (1 test module):
quilt
You can also run individual test modules:
# Test a specific feature
$ python3 tests/test_chainables_select.py
$ python3 tests/test_finalizers_show.py
$ python3 tests/test_quilters_quilt.py
# Test using unittest module
$ python3 -m unittest tests.test_chainables_select
The test suite provides 100% feature coverage:
- All commands and options are tested
- Various data formats and edge cases are covered
- Error handling and validation are verified
- Integration between different operations is tested
When adding new features or test cases:
- Create test files following the naming convention:
test_{category}_{feature}.py
- Inherit from
QsvTestBase
class for consistent test infrastructure - Manually add the new test class to
tests/run_tests.py
in the appropriate section:- Add import statement at the top
- Add test class to the corresponding list (
initializers
,chainables
,finalizers
, orquilters
)
- Use existing fixture files in
tests/fixtures/
or create new ones as needed - Ensure tests are self-contained and clean up any temporary files
Example of adding a new test to run_tests.py
:
# Add import
from test_chainables_newfeature import TestNewFeature
# Add to chainables list
chainables = [
TestSelect,
TestHead,
# ... existing tests ...
TestNewFeature, # Add your new test here
]
Contributions are welcome! Please see CONTRIBUTING.md for guidelines.
This project is licensed under the MIT License - see the LICENSE file for details.
Inspired by xsv.