Skip to content
/ qsv-rs Public

A fast, flexible, and memory-efficient command-line tool written in Rust for processing large CSV files.

License

Notifications You must be signed in to change notification settings

sumeshi/qsv-rs

Repository files navigation

Quilter-CSV

MIT License CI/CD Pipeline

qsv-rs

A fast, flexible, and memory-efficient command-line tool written in Rust for processing large CSV files. Inspired by xsv and built on Polars, it's designed for handling tens or hundreds of gigabytes of CSV data efficiently in workflows like log analysis and digital forensics.

Note

The original version of this project was implemented in Python and can be found at sumeshi/quilter-csv. This Rust version is a complete rewrite.

Features

  • Pipeline-style command chaining: Chain multiple commands in a single line for fast and efficient data processing
  • Flexible filtering and transformation: Perform operations like select, filter, sort, deduplicate, and timezone conversion
  • YAML-based batch processing (Quilt): Automate complex workflows using YAML configuration files

Usage

Getting Help

To see available commands and options, run qsv without any arguments:

$ qsv -h

Example

Here's an example of reading a CSV file, extracting rows that contain 4624 in the 'Event ID' column, and displaying the top 3 rows sorted by the 'Date and Time' column:

$ qsv load Security.csv - isin 'Event ID' 4624 - sort 'Date and Time' - head 3 - showtable

This command:

  1. Loads Security.csv
  2. Filters rows where Event ID is 4624
  3. Sorts by Date and Time
  4. Shows the first 3 rows as a table

Command Structure

Quilter-CSV commands are composed of three types of steps:

  • Initializer: Loads data (e.g., load)
  • Chainable: Transforms or filters data (e.g., select, grep, sort, etc.)
  • Finalizer: Outputs or summarizes data (e.g., show, showtable, headers, etc.)

Each step is separated by a hyphen (-):

$ qsv <INITIALIZER> <args> - <CHAINABLE> <args> - <FINALIZER> <args>

Note: If no finalizer is explicitly specified, showtable is automatically used as the default finalizer, making it easy to quickly view results:

$ qsv load data.csv - select col1,col2 - head 5
# Equivalent to:
$ qsv load data.csv - select col1,col2 - head 5 - showtable

Command Reference

Initializers

load

Load one or more CSV files.

Parameter Type Default Description
path list[str] One or more paths to CSV files. Glob patterns are supported. Gzip files (.gz) are automatically detected and decompressed.
-s, --separator str , Field separator character.
--low-memory flag false Enable low-memory mode for very large files.
--no-headers flag false Treat the first row as data, not headers. When enabled, columns will be named automatically (column_0, column_1, etc.).
--chunk-size int (auto) Number of rows to read per chunk. Controls memory usage during file processing. Smaller values use less memory but increase I/O operations.

Example:

$ qsv load data.csv
$ qsv load data.csv.gz
$ qsv load data1.csv data2.csv data3.csv
$ qsv load "logs/*.tsv" -s \t
$ qsv load logs/*.tsv --separator=\t
$ qsv load data.csv --low-memory
$ qsv load data.csv --no-headers
$ qsv load data.csv --chunk-size 50000

Chainable Functions

select

Select columns by name, numeric index, or range notation.

Parameter Type Default Description
colnames str/list/range Column name(s) or indices. Supports multiple formats (see examples below). This is a required argument.

Column Selection Formats:

  • Individual columns: col1,col3 - Select specific columns by name
  • Numeric indices: 1,3 - Select columns by position (1-based indexing)
  • Range notation (hyphen): col1-col3 - Select range using hyphen
  • Range notation (colon): col1:col3 - Select range using colon
  • Numeric range: 2:4 - Select 2nd through 4th columns (e.g., col1, col2, col3)
  • Quoted colon notation: "col:1":"col:3" - For column names containing colons
  • Mixed formats: 1,col2,4:6 - Combine different selection methods
$ qsv load data.csv - select datetime                       # Select single column by name
$ qsv load data.csv - select col1,col3                      # Select specific columns by name
$ qsv load data.csv - select col1-col3                      # Select range using hyphen
$ qsv load data.csv - select col1:col3                      # Select range using colon
$ qsv load data.csv - select 1                              # Select 1st column (datetime)
$ qsv load data.csv - select 2:4                            # Select 2nd-4th columns (col1, col2, col3)
$ qsv load data.csv - select 2,4                            # Select 2nd and 4th columns (col1, col3)
$ qsv load data.csv - select "col:1":"col:3"                # For columns with colons in names
$ qsv load data.csv - select 1,datetime,3:5                 # Mixed selection methods

isin

Filter rows where a column matches any of the given values.

Parameter Type Default Description
colname str Column name to filter. Required.
values list Comma-separated values. Filters rows where the column matches any of these values (OR condition). Required.
$ qsv load data.csv - isin col1 1
$ qsv load data.csv - isin col1 1,4

contains

Filter rows where a column contains a specific literal substring.

Parameter Type Default Description
colname str Column name to search. Required.
substring str The literal substring to search for. Required.
-i, --ignorecase flag false Perform case-insensitive matching.
$ qsv load data.csv - contains str ba
$ qsv load data.csv - contains str BA -i
$ qsv load data.csv - contains str BA --ignorecase

sed

Replace values in column(s) using a Regex pattern.

Parameter Type Default Description
pattern str Regex pattern to search for. Required.
replacement str Replacement string. Required.
--column str (all) Apply replacement to specific column only. If not specified, applies to all columns.
-i, --ignorecase flag false Perform case-insensitive matching.
$ qsv load data.csv - sed foo foooooo                       # Replace 'foo' with 'foooooo' in all columns
$ qsv load data.csv - sed foo foooooo --column str          # Replace 'foo' with 'foooooo' in 'str' column only
$ qsv load data.csv - sed FOO foooooo -i                    # Case-insensitive replacement in all columns
$ qsv load data.csv - sed ".*o.*" foooooo --column str      # Regex replacement in specific column

grep

Filter rows where any column matches a regex pattern.

Parameter Type Default Description
pattern str Regex pattern to search for in any column. Required.
-i, --ignorecase flag false Perform case-insensitive matching.
-v, --invert-match flag false Invert the sense of matching, to select non-matching lines.

Example:

$ qsv load data.csv - grep foo 
$ qsv load data.csv - grep "^FOO" -i                        # Case-insensitive search
$ qsv load data.csv - grep "^FOO" --ignorecase              # Long form case-insensitive
$ qsv load data.csv - grep "^FOO" -i -v                     # Case-insensitive inverted match
$ qsv load data.csv - grep "^FOO" --ignorecase --invert-match  # Long form inverted match

head

Displays the first N rows of the dataset.

Parameter Type Default Description
number int Number of rows to display. This is a required positional argument.
$ qsv load data.csv - head 3
$ qsv load data.csv - head 10

tail

Displays the last N rows of the dataset.

Parameter Type Default Description
number int Number of rows to display. This is a required positional argument.
$ qsv load data.csv - tail 3
$ qsv load data.csv - tail 10

sort

Sorts the dataset based on the specified column(s).

Parameter Type Default Description
colnames str/list Column name(s) to sort by. Comma-separated for multiple columns (e.g., col1,col3) or a single column name. Required.
-d, --desc flag false Sort in descending order. Applies to all specified columns.
$ qsv load data.csv - sort str
$ qsv load data.csv - sort str -d
$ qsv load data.csv - sort str --desc
$ qsv load data.csv - sort col1,col2,col3 --desc

count

Count duplicate rows, grouping by all columns. Results are automatically sorted by count in descending order.

Parameter Type Default Description
(None) Takes no arguments. Automatically sorts output by count column in descending order.
$ qsv load data.csv - count
$ qsv load data.csv - count - sort col1  # Count and then sort by col1 instead

uniq

Filters unique rows, removing duplicates based on all columns.

Parameter Type Default Description
(None) Takes no arguments. Removes duplicate rows based on all columns.
$ qsv load data.csv - uniq

changetz

Changes the timezone of a datetime column.

Parameter Type Default Description
colname str Name of the datetime column. Required.
--from-tz str Source timezone (e.g., UTC, America/New_York, local). Required.
--to-tz str Target timezone (e.g., Asia/Tokyo). Required.
--input-format str auto Input datetime format string (e.g., %Y-%m-%d %H:%M:%S%.f). auto uses intelligent parsing similar to Python's dateutil.parser, supporting fuzzy parsing and automatic format detection.
--output-format str auto Output datetime format string (e.g., %Y/%m/%d %H:%M:%S). auto uses ISO8601 format %Y-%m-%dT%H:%M:%S%.7f%:z (100-nanosecond precision for Windows forensics).
--ambiguous str earliest Strategy for ambiguous times during DST transitions: earliest (first occurrence) or latest (second occurrence).

Understanding --ambiguous option:

During Daylight Saving Time (DST) transitions in autumn, clocks "fall back" creating duplicate hours. For example, 2:30 AM occurs twice:

  • First time: 2:30 AM DST (before transition)
  • Second time: 2:30 AM Standard Time (after transition)

When encountering such ambiguous times:

  • earliest: Uses the first occurrence (DST time)
  • latest: Uses the second occurrence (Standard time)

Example:

$ qsv load data.csv - changetz datetime --from-tz UTC --to-tz Asia/Tokyo
# Output: 2023-01-01T09:00:00.123456+09:00 (ISO8601 with microsecond precision)

$ qsv load data.csv - changetz datetime --from-tz UTC --to-tz America/New_York --input-format "%Y/%m/%d %H:%M" --output-format "%Y-%m-%d %H:%M:%S"
# Custom output format

$ qsv load data.csv - changetz datetime --from-tz America/New_York --to-tz UTC --ambiguous latest
# Handle ambiguous DST times

# Automatic format detection (similar to Python dateutil.parser):
$ qsv load logs.csv - changetz timestamp --from-tz local --to-tz UTC
# Handles: "Jan 15, 2023 2:30 PM", "2023/01/15 14:30", "15-Jan-2023 14:30:00", etc.

# Fuzzy parsing with embedded text:
$ qsv load events.csv - changetz event_time --from-tz EST --to-tz UTC  
# Handles: "Meeting on January 15th, 2023 at 2:30 PM", "Call scheduled for Jan 15 2023"

TODO: Upgrade to 7-digit sub-second precision (100-nanosecond precision for Windows FILETIME compatibility) when chrono-tz library supports it.

renamecol

Renames a specific column.

Parameter Type Default Description
old_name str The current column name. Required.
new_name str The new column name. Required.
$ qsv load data.csv - renamecol current_name new_name

convert

Converts data formats between JSON, YAML, and XML. Also supports formatting/prettifying data in the same format.

Parameter Type Default Description
colname str Column name containing the data to convert. Required.
--from str Source format: json, yaml, or xml. Required.
--to str Target format: json, yaml, or xml. Required.

Supported conversions:

  • Cross-format: json ↔ yaml, json ↔ xml, yaml ↔ xml
  • Same-format (formatting): json → json, yaml → yaml, xml → xml

Features:

  • Automatically handles malformed JSON with extra quotes
  • Prettifies and formats data for better readability
  • Preserves data structure during conversion

Example:

$ qsv load data.csv - convert json_col --from json --to yaml
$ qsv load data.csv - convert config --from yaml --to json
$ qsv load data.csv - convert data --from json --to xml
$ qsv load data.csv - convert messy_json --from json --to json  # Format/prettify JSON
$ qsv load data.csv - convert compact_yaml --from yaml --to yaml  # Format YAML

timeline

Aggregates data by time intervals, creating time-based summaries.

Parameter Type Default Description
time_column str Name of the datetime column to use for time bucketing. Required.
--interval str Time interval for aggregation (e.g., 1h, 30m, 5s, 1d). Required.
--sum str Column name to sum within each time bucket. Optional.
--avg str Column name to average within each time bucket. Optional.
--min str Column name to find minimum within each time bucket. Optional.
--max str Column name to find maximum within each time bucket. Optional.
--std str Column name to calculate standard deviation within each time bucket. Optional.

Features:

  • Creates a time bucket column named timeline_{interval} (e.g., timeline_1h, timeline_30m)
  • If no aggregation column is specified, only row counts are provided for each time bucket
  • Supports various time interval formats: hours (1h), minutes (30m), seconds (5s), days (1d)

Example:

$ qsv load access.log - timeline timestamp --interval 1h
# Creates column: timeline_1h

$ qsv load metrics.csv - timeline time --interval 5m --avg cpu_usage
# Creates columns: timeline_5m, count, avg_cpu_usage

$ qsv load sales.csv - timeline date --interval 1d --sum amount
# Creates columns: timeline_1d, count, sum_amount

$ qsv load server.log - timeline timestamp --interval 30s --max response_time
# Creates columns: timeline_30s, count, max_response_time

timeslice

Filters data based on time ranges, extracting records within specified time boundaries.

Parameter Type Default Description
time_column str Name of the datetime column to filter on. Required.
--start str Start time (inclusive). Optional.
--end str End time (inclusive). Optional.

At least one of --start or --end must be specified. Supports various datetime formats including ISO8601, timestamps, and common log formats.

Example:

$ qsv load data.csv - timeslice timestamp --start "2023-01-01 00:00:00"
$ qsv load data.csv - timeslice timestamp --end "2023-12-31 23:59:59"
$ qsv load data.csv - timeslice timestamp --start "2023-06-01" --end "2023-06-30"
$ qsv load access.log - timeslice timestamp --start "2023-01-01T10:00:00"

pivot

Creates pivot tables with cross-tabulation functionality.

Parameter Type Default Description
--rows str Comma-separated list of columns for rows. Optional.
--cols str Comma-separated list of columns for columns. Optional.
--values str Column to aggregate values from. Required.
--agg str Aggregation function: sum, mean, count, min, max, median, std. Optional (default behavior depends on implementation).

At least one of --rows or --cols must be specified. Creates a cross-tabulation table with specified row and column groupings, aggregating values using the chosen function.

Example:

$ qsv load sales.csv - pivot --rows region --cols product --values sales_amount --agg sum
$ qsv load data.csv - pivot --rows category --cols year --values revenue --agg mean
$ qsv load logs.csv - pivot --rows date --cols error_type --values count --agg count
$ qsv load metrics.csv - pivot --rows department --values performance --agg median

timeround

Rounds datetime values to specified time units, creating a new rounded column while preserving the original.

Parameter Type Default Description
colname str Name of the datetime column to round. Required.
--unit str Time unit for rounding: y/year, M/month, d/day, h/hour, m/minute, s/second. Required.
--output str (replaces original) Name for the output column. If not specified, replaces the original column.

Features:

  • Rounds datetime values down to the nearest specified time unit boundary
  • Useful for time-based grouping and analysis
  • Supports both short (h, d) and long (hour, day) unit names
  • Output format automatically adjusts to the specified unit (clean, minimal format)

Output formats by unit:

  • year (y): 2023
  • month (M): 2023-01
  • day (d): 2023-01-01
  • hour (h): 2023-01-01 12
  • minute (m): 2023-01-01 12:34
  • second (s): 2023-01-01 12:34:56

Example:

$ qsv load data.csv - timeround timestamp --unit d --output date_only
# Input:  2023-01-01 12:34:56
# Output: 2023-01-01

$ qsv load data.csv - timeround timestamp --unit h --output hour_rounded
# Input:  2023-01-01 12:34:56
# Output: 2023-01-01 12

$ qsv load logs.csv - timeround timestamp --unit m
# Rounds to minute boundary, replaces original column

$ qsv load metrics.csv - timeround created_at --unit year --output created_year
# Input:  2023-01-01 12:34:56
# Output: 2023

Finalizers

Finalizers are used to output or summarize the processed data. They are typically the last command in a chain.

partition

Splits data into separate CSV files based on unique values in a specified column. Each unique value creates its own file.

Parameter Type Default Description
colname str Column name to partition by. Required.
output_directory str Directory to save partitioned files. Required.

The output directory will be created if it doesn't exist. Each file is named after the unique value in the partition column (with invalid filename characters replaced by underscores).

Example:

$ qsv load data.csv - partition category ./partitions/
$ qsv load sales.csv - partition region ./by_region/
$ qsv load logs.csv - partition date ./daily_logs/
$ qsv load data.csv - select col1,col2 - partition col1 ./numeric_partitions/

headers

Displays the column headers of the current dataset.

Parameter Type Default Description
-p, --plain flag false Display headers as plain text, one per line, instead of a formatted table.

Example:

$ qsv load data.csv - headers
$ qsv load data.csv - headers -p
$ qsv load data.csv - headers --plain

stats

Displays summary statistics for each column in the dataset (e.g., count, null_count, mean, std, min, max).

This command does not take any arguments or options.

Example:

$ qsv load data.csv - stats

showquery

Displays the Polars LazyFrame query plan. This is useful for debugging and understanding the operations being performed.

This command does not take any arguments or options.

Example:

$ qsv load data.csv - select col1 - showquery

show

Displays the resulting data as CSV to standard output. Header is included by default.

This command does not take any arguments or options.

Example:

$ qsv load data.csv - head 5 - show

showtable

Displays the resulting data in a formatted table to standard output. Shows table dimensions and intelligently truncates large datasets.

Features:

  • Displays table size information (rows × columns) like Python Polars
  • For datasets with 8+ rows: shows first 3 rows, truncation indicator (), and last 3 rows
  • For datasets with 7 or fewer rows: shows all rows without truncation
  • Automatically used as default finalizer when no explicit finalizer is specified

This command does not take any arguments or options.

Example:

$ qsv load data.csv - select col1,col2 - head 3 - showtable
# Output includes: shape: (3, 2) followed by formatted table

$ qsv load large_data.csv - select col1,col2
# Automatically calls showtable if no finalizer specified

dump

Outputs the processing results to a CSV file.

Parameter Type Default Description
-o, --output str File path to save the CSV data. Required.
-s, --separator char , Field separator character for the output CSV file.

Example:

$ qsv load data.csv - head 100 - dump -o results.csv
$ qsv load data.csv - head 100 - dump --output results.csv
$ qsv load data.csv - head 100 - dump -o results.csv -s ';'

Quilt (YAML Workflows)

Quilt allows you to define complex data processing workflows in YAML configuration files. This is useful for automating repetitive tasks or creating reusable data processing pipelines.

Usage

The quilt command itself takes the path to a YAML configuration file. Input data sources and other parameters are typically defined within the YAML file.

$ qsv quilt <config_file_path.yaml> [options]
Parameter Type Description
config_file_path.yaml str Path to the YAML configuration file defining the pipeline stages. Required.
-o, --output str Overrides the output path defined in the YAML config for the final dump operation (if any).

Example: Running a Quilt File

$ qsv quilt rules/my_workflow.yaml
$ qsv quilt rules/my_analysis.yaml -o custom_output.csv

The YAML configuration file (e.g., rules/my_workflow.yaml) defines the stages and steps. For example, the Sample YAML (rules/test.yaml) below defines a pipeline that:

  1. Loads data (implicitly or explicitly via a load step in a process stage).
  2. Performs selections and a join operation across different stages.
  3. Displays the final result as a table.

Pipeline Operations in YAML

Within a Quilt YAML file, stages can be of different types to orchestrate the flow.

Operation Type Description Key Parameters
process Executes a series of qsv operations on a dataset. steps: Dictionary of operations (e.g., load, select, head, showtable). Each key is a qsv command, and its value contains arguments/options.
source (optional): Specifies the output of a previous stage as input.
concat Concatenates multiple datasets (stages). sources: List of stage names whose outputs to concatenate.
params.how (optional): Method for concatenation, e.g., vertical (default), horizontal.
join Joins datasets from multiple stages based on keys. sources: List of two stage names whose outputs to join.
params.left_on/params.right_on or params.on: Column(s) for joining.
params.how (optional): Join type (inner, left, outer, cross).

Installation

Pre-built Binaries

Download the latest release from GitHub Releases.

Build from Source

git clone https://github.com/sumeshi/qsv-rs.git
cd qsv-rs
cargo build --release

Testing

This project includes a comprehensive test suite to ensure all functionality works correctly.

Running All Tests

To run the complete test suite (132 tests covering all features):

$ python3 tests/run_tests.py

This will execute all tests for:

  • Initializers (1 test module): load
  • Chainables (17 test modules): select, isin, contains, sed, grep, head, tail, sort, count, uniq, changetz, renamecol, convert, timeline, timeslice, pivot, timeround
  • Finalizers (7 test modules): show, showtable, headers, stats, showquery, dump, partition
  • Quilters (1 test module): quilt

Running Individual Tests

You can also run individual test modules:

# Test a specific feature
$ python3 tests/test_chainables_select.py
$ python3 tests/test_finalizers_show.py
$ python3 tests/test_quilters_quilt.py

# Test using unittest module
$ python3 -m unittest tests.test_chainables_select

Test Coverage

The test suite provides 100% feature coverage:

  • All commands and options are tested
  • Various data formats and edge cases are covered
  • Error handling and validation are verified
  • Integration between different operations is tested

Adding New Tests

When adding new features or test cases:

  1. Create test files following the naming convention: test_{category}_{feature}.py
  2. Inherit from QsvTestBase class for consistent test infrastructure
  3. Manually add the new test class to tests/run_tests.py in the appropriate section:
    • Add import statement at the top
    • Add test class to the corresponding list (initializers, chainables, finalizers, or quilters)
  4. Use existing fixture files in tests/fixtures/ or create new ones as needed
  5. Ensure tests are self-contained and clean up any temporary files

Example of adding a new test to run_tests.py:

# Add import
from test_chainables_newfeature import TestNewFeature

# Add to chainables list
chainables = [
    TestSelect,
    TestHead,
    # ... existing tests ...
    TestNewFeature,  # Add your new test here
]

Contributing

Contributions are welcome! Please see CONTRIBUTING.md for guidelines.

License

This project is licensed under the MIT License - see the LICENSE file for details.

Inspired by xsv.

About

A fast, flexible, and memory-efficient command-line tool written in Rust for processing large CSV files.

Topics

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages