-
Notifications
You must be signed in to change notification settings - Fork 188
Description
Bug Report: CSVReader(std::istream&, CSVFormat) corrupts row["ets"] around the 10MB chunk boundary (field spans across newline)
Summary
When parsing a large CSV via the stream constructor csv::CSVReader(std::istream&, CSVFormat), the parser returns a corrupted value for the first column (ets) at a deterministic position around the 10MB chunk boundary (roughly ~61k rows for our file).
Instead of returning the numeric timestamp from the first column, row["ets"].get_sv() becomes a multi-line string that starts in the middle of a row (e.g. "4864,528,689,...") and even contains a newline plus the next record.
This breaks downstream numeric conversion and manifests as “ets not a number” in application code.
Environment
- csv-parser version: 2.3.0 (system-installed
libvincentlaucsb-csv-parser-csv) - OS: Rocky Linux 9.6 (Linux)
- Compiler: GCC 15, C++23
- Input characteristics:
- 32 columns, header row present (
header_row(0)) - ~79k lines, ~13MB
- First column
etsis numeric in the raw file
- 32 columns, header row present (
Expected Behavior
At all rows (including around ~61640), row["ets"].get_sv() should be a numeric timestamp string like:
20250821213806000
No newline should ever appear inside a CSV field value.
Actual Behavior
At a deterministic position around the chunk boundary (data-row index n=61640, 1-based excluding header), the returned ets string becomes corrupted and spans multiple lines, e.g.:
expected_cols=32
n=61639 row.size=32 ets_sv='20250821213805500'
n=61640 row.size=32 ets_sv='4864,528,689,924,1015,938,113,87,73,104,106,4854,4852,4850,4848,4846,351,1548,1832,834,653,94,169,271,156,105
20250821213806250,1755783486604443,4856,4858,4860,4862,4864,...'
Notes:
row.size()is still reported as 32, but the first field is clearly incorrect (contains commas/newlines).- The raw CSV lines around that region are valid and have numeric
etsvalues.
Minimal Reproducer (no mmap; seekable stream)
This reproducer reads the file into memory and feeds it to CSVReader(std::istringstream, fmt), so the stream is seekable and this is not related to mmap.
#include <fstream>
#include <iostream>
#include <sstream>
#include <string>
#include <vincentlaucsb-csv-parser/csv.hpp>
int main() {
const std::string path = "large.csv"; // ~13MB, 32 columns, header row
std::ifstream in(path, std::ios::binary);
if (!in) return 1;
std::string csv_data;
in.seekg(0, std::ios::end);
csv_data.resize((size_t)in.tellg());
in.seekg(0, std::ios::beg);
in.read(csv_data.data(), (std::streamsize)csv_data.size());
csv::CSVFormat fmt;
fmt.delimiter(',')
.trim({' ', '\t', '\r'})
.header_row(0)
.variable_columns(csv::VariableColumnPolicy::IGNORE_ROW);
fmt.quote('\0'); // disable quoting
std::istringstream iss(csv_data);
csv::CSVReader reader(iss, fmt);
const size_t expected_cols = reader.get_col_names().size();
std::cerr << "expected_cols=" << expected_cols << "\n";
size_t n = 0;
for (csv::CSVRow &row : reader) {
++n;
if (n >= 61638 && n <= 61642) {
auto sv = row["ets"].get_sv();
std::cerr << "n=" << n << " row.size=" << row.size()
<< " ets_sv='" << std::string(sv) << "'\n";
}
if (n > 61650) break;
}
}Build (example):
g++ -std=c++23 -O2 repro.cpp -o repro -lvincentlaucsb-csv-parser-csv
./reproWhy this is a bug
ets is the first column; the raw file line at that position starts with a numeric timestamp.
The library returning a multi-line value for a field indicates row/field boundary corruption.
Suspected Root Cause (likely in CSVReader chunk/thread orchestration)
This appears around the chunk boundary and looks like the stream reader starts parsing a chunk mid-row or corrupts parser state, causing a field to include newline(s).
In our local analysis, this resembles a race in CSVReader::read_row() / worker-thread scheduling where the code can mis-handle the “worker active” signal and/or check EOF before joining the worker thread, potentially leading to unexpected extra next() calls or concurrent use of the parser.
Additional context
We also observed (and already reported separately) #280 that mmap path can throw std::error_code from a worker thread and crash via std::terminate. This stream corruption seems to be a separate user-visible symptom but may share the same CSVReader scheduling root cause.