RustyCSV

Ultra-fast CSV parsing and encoding for Elixir. A purpose-built Rust NIF with SIMD acceleration, parallel parsing, and bounded-memory streaming. Drop-in replacement for NimbleCSV.

Why RustyCSV?

The Problem: CSV parsing in Elixir can be optimized further:

Speed: Pure Elixir parsing, while well-optimized, can't match native code with SIMD acceleration for large files.
Flexibility: Different workloads benefit from different strategies—parallel processing for huge files, streaming for unbounded data.
Binary chunk streaming: RustyCSV can process arbitrary binary chunks (useful for network streams, compressed data, etc.).

Why not wrap an existing Rust CSV library? The excellent csv crate is designed for Rust workflows, not BEAM integration. Wrapping it would require serializing data between Rust and Erlang formats—adding overhead and losing the benefits of direct term construction.

RustyCSV's approach: The Rust NIF is purpose-built for BEAM integration—no wrapped CSV libraries, no unnecessary abstractions, and resource-efficient at runtime with modular features you opt into—focusing on:

Bounded memory streaming - Process multi-GB files with ~64KB memory footprint
Sub-binary field references - Near-zero BEAM allocation; fields reference the input binary directly
Multiple strategies - Choose SIMD, parallel, or streaming based on your workload
Reduced scheduler load - Parallel strategy runs on dirty CPU schedulers
Full NimbleCSV compatibility - Same API, drop-in replacement

Feature Comparison

Feature	RustyCSV	Pure Elixir (NimbleCSV)
Parsing strategies	3 (SIMD, parallel, streaming)	1
SIMD acceleration	✅ via `std::simd` portable SIMD	❌
Parallel parsing	✅ via rayon	❌
Binary chunk streaming	✅ arbitrary chunks	❌ line-delimited only
Multi-separator support	✅ `[",", ";"]`, `"::"`	✅
Encoding support	✅ UTF-8, UTF-16, Latin-1, UTF-32	✅
Memory model	Sub-binary references	Sub-binary references
NIF encoding	✅ Returns flat binary (same bytes, ready to use — no flattening needed)	Returns iodata list (typically flattened by caller)
High-performance allocator	✅ mimalloc	System
Drop-in replacement	✅ Same API	-
Headers-to-maps	✅ `headers: true` or explicit keys	❌
RFC 4180 compliant	✅ 464 tests	✅
Benchmark (7MB CSV)	~20ms	~215ms

Purpose-Built for Elixir

RustyCSV isn't a wrapper around an existing Rust CSV library. It's custom-built from the ground up for optimal Elixir/BEAM integration:

Boundary-based sub-binary fields - SIMD scanner finds field boundaries, then creates BEAM sub-binary references directly (zero-copy for clean fields, copy only when unescaping "" → ")
Dirty scheduler aware - long-running parallel parses run on dirty CPU schedulers, never blocking your BEAM schedulers
ResourceArc-based streaming - stateful parser properly integrated with BEAM's garbage collector
Direct term building - parsing results go straight to BEAM terms; encoding writes directly to a flat binary

Parsing Strategies

Choose the right tool for the job:

Strategy	Use Case	How It Works
`:simd`	Default. Fastest for most files	Single-pass SIMD structural scanner via `std::simd`
`:parallel`	Files 500MB+ with complex quoting	Multi-threaded row parsing via `rayon`
`:streaming`	Unbounded/huge files	Bounded-memory chunk processing

Memory Model:

All batch strategies use boundary-based sub-binaries — the SIMD scanner finds field boundaries, then creates BEAM sub-binary references that point into the original input binary. Only fields requiring quote unescaping ("" → ") are copied.

Strategy	Memory Model	Input Binary	Best When
`:simd`	Sub-binary	Kept alive until fields GC'd	Default — fast, low memory
`:parallel`	Sub-binary	Kept alive until fields GC'd	Large files, many cores
`:streaming`	Copy (chunked)	Freed per chunk	Unbounded files

# Automatic strategy selection
CSV.parse_string(data)                           # Uses :simd (default)
CSV.parse_string(huge_data, strategy: :parallel) # 500MB+ files with complex quoting
File.stream!("huge.csv") |> CSV.parse_stream()   # Bounded memory

Installation

def deps do
  [{:rusty_csv, "~> 0.3.6"}]
end

Requires Rust nightly (for std::simd portable SIMD — see note on stabilization). Automatically compiled via Rustler.

Quick Start

alias RustyCSV.RFC4180, as: CSV

# Parse CSV (skips headers by default, like NimbleCSV)
CSV.parse_string("name,age\njohn,27\njane,30\n")
#=> [["john", "27"], ["jane", "30"]]

# Include headers
CSV.parse_string(csv, skip_headers: false)
#=> [["name", "age"], ["john", "27"], ["jane", "30"]]

# Stream large files with bounded memory
"huge.csv"
|> File.stream!()
|> CSV.parse_stream()
|> Stream.each(&process_row/1)
|> Stream.run()

# Parse to maps with headers
CSV.parse_string("name,age\njohn,27\njane,30\n", headers: true)
#=> [%{"name" => "john", "age" => "27"}, %{"name" => "jane", "age" => "30"}]

# With atom keys
CSV.parse_string("name,age\njohn,27\n", headers: [:name, :age])
#=> [%{name: "john", age: "27"}]

# Dump back to CSV
CSV.dump_to_iodata([["name", "age"], ["john", "27"]])
#=> "name,age\r\njohn,27\r\n"

Drop-in NimbleCSV Replacement

# Before
alias NimbleCSV.RFC4180, as: CSV

# After
alias RustyCSV.RFC4180, as: CSV

# That's it. Same API, 3-9x faster on typical workloads.

All NimbleCSV functions are supported:

Function	Description
`parse_string/2`	Parse CSV string to list of rows (or maps with `headers:`)
`parse_stream/2`	Lazily parse a stream (or maps with `headers:`)
`parse_enumerable/2`	Parse any enumerable
`dump_to_iodata/2`*	Convert rows to a flat binary (`strategy: :parallel` for quoting-heavy data)
`dump_to_stream/1`*	Lazily convert rows to stream of binaries (one per row)
`to_line_stream/1`	Convert arbitrary chunks to lines
`options/0`	Return module configuration

* NimbleCSV returns iodata lists; RustyCSV returns flat binaries (same bytes, no flattening needed).

Benchmarks

3.5x-9x faster than pure Elixir on synthetic benchmarks for typical data. Up to 18x faster on heavily quoted CSVs.

13-28% faster than pure Elixir on real-world TSV files (10K+ rows). Speedup varies by data complexity—quoted fields with escapes show the largest gains.

mix run bench/csv_bench.exs

See docs/BENCHMARK.md for detailed methodology and results.

When to Use RustyCSV

Scenario	Recommendation
Large files (1-500MB)	✅ Use `:simd` (default) - biggest wins
Very large files (500MB+)	✅ Use `:parallel` with complex quoted data
Huge/unbounded files	✅ Use `parse_stream/2` - bounded memory
Maximum speed	✅ Use `:simd` (default) - sub-binary refs, 5-14x less memory
High-throughput APIs	✅ Reduced scheduler load
Small files (<100KB)	Either works - NIF overhead negligible
Need pure Elixir	Use NimbleCSV

Custom Parsers

Define parsers with custom separators and options:

# TSV parser
RustyCSV.define(MyApp.TSV,
  separator: "\t",
  escape: "\"",
  line_separator: "\n"
)

# Pipe-separated
RustyCSV.define(MyApp.PSV,
  separator: "|",
  escape: "\"",
  line_separator: "\n"
)

MyApp.TSV.parse_string("a\tb\tc\n1\t2\t3\n")
#=> [["1", "2", "3"]]

Define Options

Option	Description	Default
`:separator`	Field separator(s) — string or list of strings (multi-byte OK)	`","`
`:escape`	Quote/escape sequence (multi-byte OK)	`"\""`
`:line_separator`	Line ending for dumps	`"\r\n"`
`:newlines`	Accepted line endings	`["\r\n", "\n"]`
`:encoding`	Character encoding (see below)	`:utf8`
`:trim_bom`	Remove BOM when parsing	`false`
`:dump_bom`	Add BOM when dumping	`false`
`:escape_formula`	Escape formula injection	`nil`
`:strategy`	Default parsing strategy	`:simd`

Multi-Separator Support

For files with inconsistent delimiters (common in European locales), specify multiple separators:

# Accept both comma and semicolon as delimiters
RustyCSV.define(MyApp.FlexibleCSV,
  separator: [",", ";"],
  escape: "\""
)

# Parse files with mixed separators
MyApp.FlexibleCSV.parse_string("a,b;c\n1;2,3\n", skip_headers: false)
#=> [["a", "b", "c"], ["1", "2", "3"]]

# Dumping uses only the FIRST separator
MyApp.FlexibleCSV.dump_to_iodata([["x", "y", "z"]]) |> IO.iodata_to_binary()
#=> "x,y,z\n"

Separators and escape sequences can be multi-byte:

# Double-colon separator
RustyCSV.define(MyApp.DoubleColon,
  separator: "::",
  escape: "\""
)

# Multi-byte escape
RustyCSV.define(MyApp.DollarEscape,
  separator: ",",
  escape: "$$"
)

# Mix single-byte and multi-byte separators
RustyCSV.define(MyApp.Mixed,
  separator: [",", "::"],
  escape: "\""
)

Headers-to-Maps

Return rows as maps instead of lists using the :headers option:

# First row becomes string keys
CSV.parse_string("name,age\njohn,27\njane,30\n", headers: true)
#=> [%{"name" => "john", "age" => "27"}, %{"name" => "jane", "age" => "30"}]

# Explicit atom keys (first row skipped by default)
CSV.parse_string("name,age\njohn,27\n", headers: [:name, :age])
#=> [%{name: "john", age: "27"}]

# Explicit string keys
CSV.parse_string("name,age\njohn,27\n", headers: ["n", "a"])
#=> [%{"n" => "john", "a" => "27"}]

# Works with streaming too
"huge.csv"
|> File.stream!()
|> CSV.parse_stream(headers: true)
|> Stream.each(&process_map/1)
|> Stream.run()

Edge cases: fewer columns than headers fills with nil, extra columns are ignored, duplicate headers use last value, empty headers become "".

Key interning is done Rust-side for parse_string — header terms are allocated once and reused across all rows. Streaming uses Elixir-side Stream.transform for map conversion.

Encoding Support

RustyCSV supports character encoding conversion:

# UTF-16 Little Endian (Excel/Windows exports)
RustyCSV.define(MyApp.Spreadsheet,
  separator: "\t",
  encoding: {:utf16, :little},
  trim_bom: true,
  dump_bom: true
)

# Or use the pre-defined spreadsheet parser
alias RustyCSV.Spreadsheet
Spreadsheet.parse_string(utf16_data)

Encoding	Description
`:utf8`	UTF-8 (default, no conversion overhead)
`:latin1`	ISO-8859-1 / Latin-1
`{:utf16, :little}`	UTF-16 Little Endian
`{:utf16, :big}`	UTF-16 Big Endian
`{:utf32, :little}`	UTF-32 Little Endian
`{:utf32, :big}`	UTF-32 Big Endian

RFC 4180 Compliance

RustyCSV is fully RFC 4180 compliant and validated against industry-standard test suites:

Test Suite	Tests	Status
csv-spectrum	17	✅ All pass
csv-test-data	23	✅ All pass
Edge cases (PapaParse-inspired)	53	✅ All pass
Core + NimbleCSV compat	36	✅ All pass
Encoding (UTF-16, Latin-1, etc.)	20	✅ All pass
Multi-separator support	19	✅ All pass
Multi-byte separator	13	✅ All pass
Multi-byte escape	12	✅ All pass
Native API separator/escape	40	✅ All pass
Headers-to-maps	97	✅ All pass
Custom newlines	18	✅ All pass
Streaming safety	12	✅ All pass
Concurrent access	7	✅ All pass
Total	464	✅

See docs/COMPLIANCE.md for full compliance details.

How It Works

Why Not Wrap the Rust `csv` Crate?

The Rust ecosystem has excellent CSV libraries like csv and polars. But wrapping them for Elixir has overhead:

Parse CSV → Rust data structures (allocation)
Convert Rust structs → Erlang terms (allocation + serialization)
Return to BEAM

RustyCSV eliminates the middle step by parsing directly into BEAM terms:

Parse CSV → Erlang terms directly (single pass)
Return to BEAM

Strategy Implementations

All batch strategies share a single-pass SIMD structural scanner that finds field boundaries, then create BEAM sub-binary references directly.

Strategy	Scanning	Term Building	Memory	Best For
`:simd`	SIMD structural scanner via `std::simd`	Boundary → sub-binary	O(n)	Default, fastest for most files
`:parallel`	SIMD structural scanner	Boundary → sub-binary	O(n)	Large files with many cores
`:streaming`	Byte-by-byte	Copy (chunked)	O(chunk)	Unbounded/huge files

Shared across batch strategies (:simd, :parallel):

Single-pass SIMD structural scanner (finds all unquoted separators and row endings in one sweep)
Boundary-based sub-binary field references (near-zero BEAM allocation)
Hybrid unescaping: sub-binaries for clean fields, copy only when "" → " unescaping needed
Direct Erlang term construction via Rustler (no serde)
mimalloc high-performance allocator

:parallel specifics:

Runs on dirty CPU schedulers to avoid blocking BEAM
Rayon workers compute boundary pairs (pure index arithmetic on the shared structural index) — no data copying
Main thread builds BEAM sub-binary terms from boundaries (Env is not thread-safe, so term construction is serial)

:streaming specifics:

ResourceArc integrates parser state with BEAM GC
Tracks quote state across chunk boundaries
Copies field data (since input chunks are temporary)

NIF-Accelerated Encoding

RustyCSV's dump_to_iodata returns a single flat binary rather than an iodata list. The output bytes are identical to NimbleCSV — the flat binary is ready for use directly with IO.binwrite/2, File.write/2, or Conn.send_resp/3 without any flattening step.

Note: NimbleCSV returns an iodata list (nested small binaries) that callers typically flatten back into a binary. RustyCSV skips that roundtrip. Code that pattern-matches on dump_to_iodata expecting a list will need adjustment — the return value is a binary, which is valid t:iodata/0.

See docs/BENCHMARK.md for encoding throughput and memory numbers.

Encoding Strategies

dump_to_iodata/2 accepts a :strategy option:

# Default: single-threaded flat binary encoder.
# SIMD scan for quoting, writes directly to output buffer.
# Best for most workloads.
CSV.dump_to_iodata(rows)

# Parallel: multi-threaded encoding via rayon.
# Copies field data into Rust-owned memory, encodes chunks on separate threads.
# Faster when fields frequently need quoting (commas, quotes, newlines in values).
CSV.dump_to_iodata(rows, strategy: :parallel)

Encoding Strategy	Best For	Output
default	Most data — clean fields, moderate quoting	Single flat binary
`:parallel`	Quoting-heavy data (user-generated content, free-text with embedded commas/quotes/newlines)	Short list of large binaries

High-Throughput Concurrent Exports

RustyCSV's encoding NIF runs on BEAM dirty CPU schedulers with per-thread mimalloc arenas, making it well-suited for concurrent export workloads (e.g., thousands of users downloading CSV reports simultaneously in a Phoenix application):

# Phoenix controller — concurrent CSV download
def export(conn, %{"id" => id}) do
  rows = MyApp.Reports.fetch_rows(id)
  csv = MyCSV.dump_to_iodata(rows)

  conn
  |> put_resp_content_type("text/csv")
  |> put_resp_header("content-disposition", ~s(attachment; filename="report.csv"))
  |> send_resp(200, csv)
end

For very large exports where you want bounded memory, use chunked NIF encoding:

# Chunked encoding — bounded memory with NIF speed
def stream_export(conn, %{"id" => id}) do
  conn = conn
  |> put_resp_content_type("text/csv")
  |> put_resp_header("content-disposition", ~s(attachment; filename="report.csv"))
  |> send_chunked(200)

  MyApp.Reports.stream_rows(id)
  |> Stream.chunk_every(5_000)
  |> Stream.each(fn chunk ->
    csv = MyCSV.dump_to_iodata(chunk)
    Conn.chunk(conn, csv)
  end)
  |> Stream.run()

  conn
end

Key characteristics for concurrent workloads:

Each NIF call is independent — no shared mutable state between requests
Dirty CPU schedulers prevent encoding from blocking normal BEAM schedulers
mimalloc's per-thread arenas avoid allocator contention under concurrency
The real bottleneck is typically DB queries and connection pool sizing, not CSV encoding

Architecture

RustyCSV is built with a modular Rust architecture:

native/rustycsv/src/
├── lib.rs                 # NIF entry points, separator/escape decoding, dispatch
├── core/
│   ├── simd_scanner.rs    # Single-pass SIMD structural scanner (prefix-XOR quote detection)
│   ├── simd_index.rs      # StructuralIndex, RowIter, RowFieldIter, FieldIter
│   ├── scanner.rs         # Byte-level helpers (separator matching)
│   ├── field.rs           # Field extraction, quote handling
│   └── newlines.rs        # Custom newline support
├── strategy/
│   ├── direct.rs          # Basic + SIMD strategies (single-byte)
│   ├── two_phase.rs       # Indexed strategy (single-byte)
│   ├── streaming.rs       # Stateful streaming parser (single-byte)
│   ├── parallel.rs        # Rayon-based parallel parsing (single-byte)
│   ├── zero_copy.rs       # Sub-binary reference parsing (single-byte)
│   ├── general.rs         # Multi-byte separator/escape (all strategies)
│   ├── encode.rs          # SIMD field scanning, quoting helpers
│   └── encoding.rs        # UTF-8 → target encoding converters (UTF-16, Latin-1, etc.)
├── term.rs                # BEAM term building (sub-binary + copy fallback)
└── resource.rs            # ResourceArc for streaming state

See docs/ARCHITECTURE.md for detailed implementation notes.

Memory Efficiency

The streaming parser uses bounded memory regardless of file size:

# Process a 10GB file with ~64KB memory
File.stream!("huge.csv", [], 65_536)
|> CSV.parse_stream()
|> Stream.each(&process/1)
|> Stream.run()

Streaming Buffer Limit

The streaming parser enforces a maximum internal buffer size (default 256 MB) to prevent unbounded memory growth when parsing data without newlines or with very long rows. If a feed exceeds this limit, a :buffer_overflow exception is raised.

To adjust the limit, pass :max_buffer_size (in bytes):

# Increase for files with very long rows
CSV.parse_stream(stream, max_buffer_size: 512 * 1024 * 1024)

# Decrease to fail fast on malformed input
CSV.parse_stream(stream, max_buffer_size: 10 * 1024 * 1024)

# Also works on direct streaming APIs
RustyCSV.Streaming.stream_file("data.csv", max_buffer_size: 1_073_741_824)

High-Performance Allocator

RustyCSV uses mimalloc as the default allocator, providing:

10-20% faster allocation for many small objects
Reduced memory fragmentation
Zero tracking overhead in default configuration

To disable mimalloc (for exotic build targets):

# In mix.exs
def project do
  [
    # Force local build without mimalloc
    compilers: [:rustler] ++ Mix.compilers(),
    rustler_crates: [rustycsv: [features: []]]
  ]
end

Optional Memory Tracking (Benchmarking Only)

For profiling Rust-side memory during development and benchmarking. Not intended for production — it wraps every allocation with atomic counter updates, adding overhead. This is also the only source of unsafe in the codebase (required by the GlobalAlloc trait). Enable the memory_tracking feature:

# In native/rustycsv/Cargo.toml
[features]
default = ["mimalloc", "memory_tracking"]

Then use:

RustyCSV.Native.reset_rust_memory_stats()
result = CSV.parse_string(large_csv)
peak = RustyCSV.Native.get_rust_memory_peak()
IO.puts("Peak Rust memory: #{peak / 1_000_000} MB")

When disabled (default), these functions return 0 with zero overhead.

SIMD and Rust Nightly

RustyCSV uses Rust's std::simd portable SIMD, which currently requires nightly via #![feature(portable_simd)]. However, RustyCSV only uses the stabilization-safe subset of the API:

Simd::from_slice, splat, simd_eq, bitwise ops (&, |, !)
Mask::to_bitmask() for extracting bit positions

We deliberately avoid the APIs that are blocking stabilization: swizzle, scatter/gather, and lane-count generics (LaneCount<N>: SupportedLaneCount). The items blocking the portable_simd tracking issue — mask semantics, supported vector size limits, and swizzle design — are unrelated to the operations we use. When std::simd stabilizes, RustyCSV will work on stable Rust with no code changes.

The prefix-XOR quote detection uses a portable shift-and-xor cascade rather than architecture-specific intrinsics, keeping the entire scanner free of unsafe code. Benchmarks show no measurable difference for the 16/32-bit masks used in CSV scanning.

Development

# Install dependencies
mix deps.get

# Compile (includes Rust NIF)
mix compile

# Run tests (464 tests)
mix test

# Run benchmarks
mix run bench/csv_bench.exs

# Code quality
mix credo --strict
mix dialyzer

License

MIT License - see LICENSE file for details.

RustyCSV - Purpose-built Rust NIF for ultra-fast CSV parsing in Elixir.

Name		Name	Last commit message	Last commit date
Latest commit History 52 Commits
.github/workflows		.github/workflows
bench		bench
docs		docs
lib		lib
native/rustycsv		native/rustycsv
test		test
.formatter.exs		.formatter.exs
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
LICENSE		LICENSE
README.md		README.md
RELEASE.md		RELEASE.md
checksum-Elixir.RustyCSV.Native.exs		checksum-Elixir.RustyCSV.Native.exs
mix.exs		mix.exs
mix.lock		mix.lock

License

jeffhuen/RustyCSV

Folders and files

Latest commit

History

Repository files navigation

RustyCSV

Why RustyCSV?

Feature Comparison

Purpose-Built for Elixir

Parsing Strategies

Installation

Quick Start

Drop-in NimbleCSV Replacement

Benchmarks

When to Use RustyCSV

Custom Parsers

Define Options

Multi-Separator Support

Headers-to-Maps

Encoding Support

RFC 4180 Compliance

How It Works

Why Not Wrap the Rust csv Crate?

Strategy Implementations

NIF-Accelerated Encoding

Encoding Strategies

High-Throughput Concurrent Exports

Architecture

Memory Efficiency

Streaming Buffer Limit

High-Performance Allocator

Optional Memory Tracking (Benchmarking Only)

SIMD and Rust Nightly

Development

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 11

Packages 0

Contributors 2

Uh oh!

Languages

Why Not Wrap the Rust `csv` Crate?

Packages