Ultra-fast CSV parsing and encoding for Elixir. A purpose-built Rust NIF with SIMD acceleration, parallel parsing, and bounded-memory streaming. Drop-in replacement for NimbleCSV.
The Problem: CSV parsing in Elixir can be optimized further:
-
Speed: Pure Elixir parsing, while well-optimized, can't match native code with SIMD acceleration for large files.
-
Flexibility: Different workloads benefit from different strategies—parallel processing for huge files, streaming for unbounded data.
-
Binary chunk streaming: RustyCSV can process arbitrary binary chunks (useful for network streams, compressed data, etc.).
Why not wrap an existing Rust CSV library? The excellent csv crate is designed for Rust workflows, not BEAM integration. Wrapping it would require serializing data between Rust and Erlang formats—adding overhead and losing the benefits of direct term construction.
RustyCSV's approach: The Rust NIF is purpose-built for BEAM integration—no wrapped CSV libraries, no unnecessary abstractions, and resource-efficient at runtime with modular features you opt into—focusing on:
- Bounded memory streaming - Process multi-GB files with ~64KB memory footprint
- Sub-binary field references - Near-zero BEAM allocation; fields reference the input binary directly
- Multiple strategies - Choose SIMD, parallel, or streaming based on your workload
- Reduced scheduler load - Parallel strategy runs on dirty CPU schedulers
- Full NimbleCSV compatibility - Same API, drop-in replacement
| Feature | RustyCSV | Pure Elixir (NimbleCSV) |
|---|---|---|
| Parsing strategies | 3 (SIMD, parallel, streaming) | 1 |
| SIMD acceleration | ✅ via std::simd portable SIMD |
❌ |
| Parallel parsing | ✅ via rayon | ❌ |
| Binary chunk streaming | ✅ arbitrary chunks | ❌ line-delimited only |
| Multi-separator support | ✅ [",", ";"], "::" |
✅ |
| Encoding support | ✅ UTF-8, UTF-16, Latin-1, UTF-32 | ✅ |
| Memory model | Sub-binary references | Sub-binary references |
| NIF encoding | ✅ Returns flat binary (same bytes, ready to use — no flattening needed) | Returns iodata list (typically flattened by caller) |
| High-performance allocator | ✅ mimalloc | System |
| Drop-in replacement | ✅ Same API | - |
| Headers-to-maps | ✅ headers: true or explicit keys |
❌ |
| RFC 4180 compliant | ✅ 464 tests | ✅ |
| Benchmark (7MB CSV) | ~20ms | ~215ms |
RustyCSV isn't a wrapper around an existing Rust CSV library. It's custom-built from the ground up for optimal Elixir/BEAM integration:
- Boundary-based sub-binary fields - SIMD scanner finds field boundaries, then creates BEAM sub-binary references directly (zero-copy for clean fields, copy only when unescaping
""→") - Dirty scheduler aware - long-running parallel parses run on dirty CPU schedulers, never blocking your BEAM schedulers
- ResourceArc-based streaming - stateful parser properly integrated with BEAM's garbage collector
- Direct term building - parsing results go straight to BEAM terms; encoding writes directly to a flat binary
Choose the right tool for the job:
| Strategy | Use Case | How It Works |
|---|---|---|
:simd |
Default. Fastest for most files | Single-pass SIMD structural scanner via std::simd |
:parallel |
Files 500MB+ with complex quoting | Multi-threaded row parsing via rayon |
:streaming |
Unbounded/huge files | Bounded-memory chunk processing |
Memory Model:
All batch strategies use boundary-based sub-binaries — the SIMD scanner finds field boundaries, then creates BEAM sub-binary references that point into the original input binary. Only fields requiring quote unescaping ("" → ") are copied.
| Strategy | Memory Model | Input Binary | Best When |
|---|---|---|---|
:simd |
Sub-binary | Kept alive until fields GC'd | Default — fast, low memory |
:parallel |
Sub-binary | Kept alive until fields GC'd | Large files, many cores |
:streaming |
Copy (chunked) | Freed per chunk | Unbounded files |
# Automatic strategy selection
CSV.parse_string(data) # Uses :simd (default)
CSV.parse_string(huge_data, strategy: :parallel) # 500MB+ files with complex quoting
File.stream!("huge.csv") |> CSV.parse_stream() # Bounded memorydef deps do
[{:rusty_csv, "~> 0.3.6"}]
endRequires Rust nightly (for std::simd portable SIMD — see note on stabilization). Automatically compiled via Rustler.
alias RustyCSV.RFC4180, as: CSV
# Parse CSV (skips headers by default, like NimbleCSV)
CSV.parse_string("name,age\njohn,27\njane,30\n")
#=> [["john", "27"], ["jane", "30"]]
# Include headers
CSV.parse_string(csv, skip_headers: false)
#=> [["name", "age"], ["john", "27"], ["jane", "30"]]
# Stream large files with bounded memory
"huge.csv"
|> File.stream!()
|> CSV.parse_stream()
|> Stream.each(&process_row/1)
|> Stream.run()
# Parse to maps with headers
CSV.parse_string("name,age\njohn,27\njane,30\n", headers: true)
#=> [%{"name" => "john", "age" => "27"}, %{"name" => "jane", "age" => "30"}]
# With atom keys
CSV.parse_string("name,age\njohn,27\n", headers: [:name, :age])
#=> [%{name: "john", age: "27"}]
# Dump back to CSV
CSV.dump_to_iodata([["name", "age"], ["john", "27"]])
#=> "name,age\r\njohn,27\r\n"# Before
alias NimbleCSV.RFC4180, as: CSV
# After
alias RustyCSV.RFC4180, as: CSV
# That's it. Same API, 3-9x faster on typical workloads.All NimbleCSV functions are supported:
| Function | Description |
|---|---|
parse_string/2 |
Parse CSV string to list of rows (or maps with headers:) |
parse_stream/2 |
Lazily parse a stream (or maps with headers:) |
parse_enumerable/2 |
Parse any enumerable |
dump_to_iodata/2* |
Convert rows to a flat binary (strategy: :parallel for quoting-heavy data) |
dump_to_stream/1* |
Lazily convert rows to stream of binaries (one per row) |
to_line_stream/1 |
Convert arbitrary chunks to lines |
options/0 |
Return module configuration |
* NimbleCSV returns iodata lists; RustyCSV returns flat binaries (same bytes, no flattening needed).
3.5x-9x faster than pure Elixir on synthetic benchmarks for typical data. Up to 18x faster on heavily quoted CSVs.
13-28% faster than pure Elixir on real-world TSV files (10K+ rows). Speedup varies by data complexity—quoted fields with escapes show the largest gains.
mix run bench/csv_bench.exsSee docs/BENCHMARK.md for detailed methodology and results.
| Scenario | Recommendation |
|---|---|
| Large files (1-500MB) | ✅ Use :simd (default) - biggest wins |
| Very large files (500MB+) | ✅ Use :parallel with complex quoted data |
| Huge/unbounded files | ✅ Use parse_stream/2 - bounded memory |
| Maximum speed | ✅ Use :simd (default) - sub-binary refs, 5-14x less memory |
| High-throughput APIs | ✅ Reduced scheduler load |
| Small files (<100KB) | Either works - NIF overhead negligible |
| Need pure Elixir | Use NimbleCSV |
Define parsers with custom separators and options:
# TSV parser
RustyCSV.define(MyApp.TSV,
separator: "\t",
escape: "\"",
line_separator: "\n"
)
# Pipe-separated
RustyCSV.define(MyApp.PSV,
separator: "|",
escape: "\"",
line_separator: "\n"
)
MyApp.TSV.parse_string("a\tb\tc\n1\t2\t3\n")
#=> [["1", "2", "3"]]| Option | Description | Default |
|---|---|---|
:separator |
Field separator(s) — string or list of strings (multi-byte OK) | "," |
:escape |
Quote/escape sequence (multi-byte OK) | "\"" |
:line_separator |
Line ending for dumps | "\r\n" |
:newlines |
Accepted line endings | ["\r\n", "\n"] |
:encoding |
Character encoding (see below) | :utf8 |
:trim_bom |
Remove BOM when parsing | false |
:dump_bom |
Add BOM when dumping | false |
:escape_formula |
Escape formula injection | nil |
:strategy |
Default parsing strategy | :simd |
For files with inconsistent delimiters (common in European locales), specify multiple separators:
# Accept both comma and semicolon as delimiters
RustyCSV.define(MyApp.FlexibleCSV,
separator: [",", ";"],
escape: "\""
)
# Parse files with mixed separators
MyApp.FlexibleCSV.parse_string("a,b;c\n1;2,3\n", skip_headers: false)
#=> [["a", "b", "c"], ["1", "2", "3"]]
# Dumping uses only the FIRST separator
MyApp.FlexibleCSV.dump_to_iodata([["x", "y", "z"]]) |> IO.iodata_to_binary()
#=> "x,y,z\n"Separators and escape sequences can be multi-byte:
# Double-colon separator
RustyCSV.define(MyApp.DoubleColon,
separator: "::",
escape: "\""
)
# Multi-byte escape
RustyCSV.define(MyApp.DollarEscape,
separator: ",",
escape: "$$"
)
# Mix single-byte and multi-byte separators
RustyCSV.define(MyApp.Mixed,
separator: [",", "::"],
escape: "\""
)Return rows as maps instead of lists using the :headers option:
# First row becomes string keys
CSV.parse_string("name,age\njohn,27\njane,30\n", headers: true)
#=> [%{"name" => "john", "age" => "27"}, %{"name" => "jane", "age" => "30"}]
# Explicit atom keys (first row skipped by default)
CSV.parse_string("name,age\njohn,27\n", headers: [:name, :age])
#=> [%{name: "john", age: "27"}]
# Explicit string keys
CSV.parse_string("name,age\njohn,27\n", headers: ["n", "a"])
#=> [%{"n" => "john", "a" => "27"}]
# Works with streaming too
"huge.csv"
|> File.stream!()
|> CSV.parse_stream(headers: true)
|> Stream.each(&process_map/1)
|> Stream.run()Edge cases: fewer columns than headers fills with nil, extra columns are ignored, duplicate headers use last value, empty headers become "".
Key interning is done Rust-side for parse_string — header terms are allocated once and reused across all rows. Streaming uses Elixir-side Stream.transform for map conversion.
RustyCSV supports character encoding conversion:
# UTF-16 Little Endian (Excel/Windows exports)
RustyCSV.define(MyApp.Spreadsheet,
separator: "\t",
encoding: {:utf16, :little},
trim_bom: true,
dump_bom: true
)
# Or use the pre-defined spreadsheet parser
alias RustyCSV.Spreadsheet
Spreadsheet.parse_string(utf16_data)| Encoding | Description |
|---|---|
:utf8 |
UTF-8 (default, no conversion overhead) |
:latin1 |
ISO-8859-1 / Latin-1 |
{:utf16, :little} |
UTF-16 Little Endian |
{:utf16, :big} |
UTF-16 Big Endian |
{:utf32, :little} |
UTF-32 Little Endian |
{:utf32, :big} |
UTF-32 Big Endian |
RustyCSV is fully RFC 4180 compliant and validated against industry-standard test suites:
| Test Suite | Tests | Status |
|---|---|---|
| csv-spectrum | 17 | ✅ All pass |
| csv-test-data | 23 | ✅ All pass |
| Edge cases (PapaParse-inspired) | 53 | ✅ All pass |
| Core + NimbleCSV compat | 36 | ✅ All pass |
| Encoding (UTF-16, Latin-1, etc.) | 20 | ✅ All pass |
| Multi-separator support | 19 | ✅ All pass |
| Multi-byte separator | 13 | ✅ All pass |
| Multi-byte escape | 12 | ✅ All pass |
| Native API separator/escape | 40 | ✅ All pass |
| Headers-to-maps | 97 | ✅ All pass |
| Custom newlines | 18 | ✅ All pass |
| Streaming safety | 12 | ✅ All pass |
| Concurrent access | 7 | ✅ All pass |
| Total | 464 | ✅ |
See docs/COMPLIANCE.md for full compliance details.
The Rust ecosystem has excellent CSV libraries like csv and polars. But wrapping them for Elixir has overhead:
- Parse CSV → Rust data structures (allocation)
- Convert Rust structs → Erlang terms (allocation + serialization)
- Return to BEAM
RustyCSV eliminates the middle step by parsing directly into BEAM terms:
- Parse CSV → Erlang terms directly (single pass)
- Return to BEAM
All batch strategies share a single-pass SIMD structural scanner that finds field boundaries, then create BEAM sub-binary references directly.
| Strategy | Scanning | Term Building | Memory | Best For |
|---|---|---|---|---|
:simd |
SIMD structural scanner via std::simd |
Boundary → sub-binary | O(n) | Default, fastest for most files |
:parallel |
SIMD structural scanner | Boundary → sub-binary | O(n) | Large files with many cores |
:streaming |
Byte-by-byte | Copy (chunked) | O(chunk) | Unbounded/huge files |
Shared across batch strategies (:simd, :parallel):
- Single-pass SIMD structural scanner (finds all unquoted separators and row endings in one sweep)
- Boundary-based sub-binary field references (near-zero BEAM allocation)
- Hybrid unescaping: sub-binaries for clean fields, copy only when
""→"unescaping needed - Direct Erlang term construction via Rustler (no serde)
- mimalloc high-performance allocator
:parallel specifics:
- Runs on dirty CPU schedulers to avoid blocking BEAM
- Rayon workers compute boundary pairs (pure index arithmetic on the shared structural index) — no data copying
- Main thread builds BEAM sub-binary terms from boundaries (Env is not thread-safe, so term construction is serial)
:streaming specifics:
ResourceArcintegrates parser state with BEAM GC- Tracks quote state across chunk boundaries
- Copies field data (since input chunks are temporary)
RustyCSV's dump_to_iodata returns a single flat binary rather than an iodata list. The output bytes are identical to NimbleCSV — the flat binary is ready for use directly with IO.binwrite/2, File.write/2, or Conn.send_resp/3 without any flattening step.
Note: NimbleCSV returns an iodata list (nested small binaries) that callers typically flatten back into a binary. RustyCSV skips that roundtrip. Code that pattern-matches on
dump_to_iodataexpecting a list will need adjustment — the return value is a binary, which is validt:iodata/0.
See docs/BENCHMARK.md for encoding throughput and memory numbers.
dump_to_iodata/2 accepts a :strategy option:
# Default: single-threaded flat binary encoder.
# SIMD scan for quoting, writes directly to output buffer.
# Best for most workloads.
CSV.dump_to_iodata(rows)
# Parallel: multi-threaded encoding via rayon.
# Copies field data into Rust-owned memory, encodes chunks on separate threads.
# Faster when fields frequently need quoting (commas, quotes, newlines in values).
CSV.dump_to_iodata(rows, strategy: :parallel)| Encoding Strategy | Best For | Output |
|---|---|---|
| default | Most data — clean fields, moderate quoting | Single flat binary |
:parallel |
Quoting-heavy data (user-generated content, free-text with embedded commas/quotes/newlines) | Short list of large binaries |
RustyCSV's encoding NIF runs on BEAM dirty CPU schedulers with per-thread mimalloc arenas, making it well-suited for concurrent export workloads (e.g., thousands of users downloading CSV reports simultaneously in a Phoenix application):
# Phoenix controller — concurrent CSV download
def export(conn, %{"id" => id}) do
rows = MyApp.Reports.fetch_rows(id)
csv = MyCSV.dump_to_iodata(rows)
conn
|> put_resp_content_type("text/csv")
|> put_resp_header("content-disposition", ~s(attachment; filename="report.csv"))
|> send_resp(200, csv)
endFor very large exports where you want bounded memory, use chunked NIF encoding:
# Chunked encoding — bounded memory with NIF speed
def stream_export(conn, %{"id" => id}) do
conn = conn
|> put_resp_content_type("text/csv")
|> put_resp_header("content-disposition", ~s(attachment; filename="report.csv"))
|> send_chunked(200)
MyApp.Reports.stream_rows(id)
|> Stream.chunk_every(5_000)
|> Stream.each(fn chunk ->
csv = MyCSV.dump_to_iodata(chunk)
Conn.chunk(conn, csv)
end)
|> Stream.run()
conn
endKey characteristics for concurrent workloads:
- Each NIF call is independent — no shared mutable state between requests
- Dirty CPU schedulers prevent encoding from blocking normal BEAM schedulers
- mimalloc's per-thread arenas avoid allocator contention under concurrency
- The real bottleneck is typically DB queries and connection pool sizing, not CSV encoding
RustyCSV is built with a modular Rust architecture:
native/rustycsv/src/
├── lib.rs # NIF entry points, separator/escape decoding, dispatch
├── core/
│ ├── simd_scanner.rs # Single-pass SIMD structural scanner (prefix-XOR quote detection)
│ ├── simd_index.rs # StructuralIndex, RowIter, RowFieldIter, FieldIter
│ ├── scanner.rs # Byte-level helpers (separator matching)
│ ├── field.rs # Field extraction, quote handling
│ └── newlines.rs # Custom newline support
├── strategy/
│ ├── direct.rs # Basic + SIMD strategies (single-byte)
│ ├── two_phase.rs # Indexed strategy (single-byte)
│ ├── streaming.rs # Stateful streaming parser (single-byte)
│ ├── parallel.rs # Rayon-based parallel parsing (single-byte)
│ ├── zero_copy.rs # Sub-binary reference parsing (single-byte)
│ ├── general.rs # Multi-byte separator/escape (all strategies)
│ ├── encode.rs # SIMD field scanning, quoting helpers
│ └── encoding.rs # UTF-8 → target encoding converters (UTF-16, Latin-1, etc.)
├── term.rs # BEAM term building (sub-binary + copy fallback)
└── resource.rs # ResourceArc for streaming state
See docs/ARCHITECTURE.md for detailed implementation notes.
The streaming parser uses bounded memory regardless of file size:
# Process a 10GB file with ~64KB memory
File.stream!("huge.csv", [], 65_536)
|> CSV.parse_stream()
|> Stream.each(&process/1)
|> Stream.run()The streaming parser enforces a maximum internal buffer size (default 256 MB)
to prevent unbounded memory growth when parsing data without newlines or with
very long rows. If a feed exceeds this limit, a :buffer_overflow exception is raised.
To adjust the limit, pass :max_buffer_size (in bytes):
# Increase for files with very long rows
CSV.parse_stream(stream, max_buffer_size: 512 * 1024 * 1024)
# Decrease to fail fast on malformed input
CSV.parse_stream(stream, max_buffer_size: 10 * 1024 * 1024)
# Also works on direct streaming APIs
RustyCSV.Streaming.stream_file("data.csv", max_buffer_size: 1_073_741_824)RustyCSV uses mimalloc as the default allocator, providing:
- 10-20% faster allocation for many small objects
- Reduced memory fragmentation
- Zero tracking overhead in default configuration
To disable mimalloc (for exotic build targets):
# In mix.exs
def project do
[
# Force local build without mimalloc
compilers: [:rustler] ++ Mix.compilers(),
rustler_crates: [rustycsv: [features: []]]
]
endFor profiling Rust-side memory during development and benchmarking. Not intended for production — it wraps every allocation with atomic counter updates, adding overhead. This is also the only source of unsafe in the codebase (required by the GlobalAlloc trait). Enable the memory_tracking feature:
# In native/rustycsv/Cargo.toml
[features]
default = ["mimalloc", "memory_tracking"]Then use:
RustyCSV.Native.reset_rust_memory_stats()
result = CSV.parse_string(large_csv)
peak = RustyCSV.Native.get_rust_memory_peak()
IO.puts("Peak Rust memory: #{peak / 1_000_000} MB")When disabled (default), these functions return 0 with zero overhead.
RustyCSV uses Rust's std::simd portable SIMD, which currently requires nightly via #![feature(portable_simd)]. However, RustyCSV only uses the stabilization-safe subset of the API:
Simd::from_slice,splat,simd_eq, bitwise ops (&,|,!)Mask::to_bitmask()for extracting bit positions
We deliberately avoid the APIs that are blocking stabilization: swizzle, scatter/gather, and lane-count generics (LaneCount<N>: SupportedLaneCount). The items blocking the portable_simd tracking issue — mask semantics, supported vector size limits, and swizzle design — are unrelated to the operations we use. When std::simd stabilizes, RustyCSV will work on stable Rust with no code changes.
The prefix-XOR quote detection uses a portable shift-and-xor cascade rather than architecture-specific intrinsics, keeping the entire scanner free of unsafe code. Benchmarks show no measurable difference for the 16/32-bit masks used in CSV scanning.
# Install dependencies
mix deps.get
# Compile (includes Rust NIF)
mix compile
# Run tests (464 tests)
mix test
# Run benchmarks
mix run bench/csv_bench.exs
# Code quality
mix credo --strict
mix dialyzerMIT License - see LICENSE file for details.
RustyCSV - Purpose-built Rust NIF for ultra-fast CSV parsing in Elixir.