Skip to content

Bug Report: CSVReader(std::istream&, CSVFormat) corrupts row["ets"] around the 10MB chunk boundary (field spans across newline) #281

@xing-cg

Description

@xing-cg

Bug Report: CSVReader(std::istream&, CSVFormat) corrupts row["ets"] around the 10MB chunk boundary (field spans across newline)

Summary

When parsing a large CSV via the stream constructor csv::CSVReader(std::istream&, CSVFormat), the parser returns a corrupted value for the first column (ets) at a deterministic position around the 10MB chunk boundary (roughly ~61k rows for our file).

Instead of returning the numeric timestamp from the first column, row["ets"].get_sv() becomes a multi-line string that starts in the middle of a row (e.g. "4864,528,689,...") and even contains a newline plus the next record.

This breaks downstream numeric conversion and manifests as “ets not a number” in application code.

Environment

  • csv-parser version: 2.3.0 (system-installed libvincentlaucsb-csv-parser-csv)
  • OS: Rocky Linux 9.6 (Linux)
  • Compiler: GCC 15, C++23
  • Input characteristics:
    • 32 columns, header row present (header_row(0))
    • ~79k lines, ~13MB
    • First column ets is numeric in the raw file

Expected Behavior

At all rows (including around ~61640), row["ets"].get_sv() should be a numeric timestamp string like:

20250821213806000

No newline should ever appear inside a CSV field value.

Actual Behavior

At a deterministic position around the chunk boundary (data-row index n=61640, 1-based excluding header), the returned ets string becomes corrupted and spans multiple lines, e.g.:

expected_cols=32
n=61639 row.size=32 ets_sv='20250821213805500'
n=61640 row.size=32 ets_sv='4864,528,689,924,1015,938,113,87,73,104,106,4854,4852,4850,4848,4846,351,1548,1832,834,653,94,169,271,156,105
20250821213806250,1755783486604443,4856,4858,4860,4862,4864,...'

Notes:

  • row.size() is still reported as 32, but the first field is clearly incorrect (contains commas/newlines).
  • The raw CSV lines around that region are valid and have numeric ets values.

Minimal Reproducer (no mmap; seekable stream)

This reproducer reads the file into memory and feeds it to CSVReader(std::istringstream, fmt), so the stream is seekable and this is not related to mmap.

#include <fstream>
#include <iostream>
#include <sstream>
#include <string>

#include <vincentlaucsb-csv-parser/csv.hpp>

int main() {
  const std::string path = "large.csv"; // ~13MB, 32 columns, header row

  std::ifstream in(path, std::ios::binary);
  if (!in) return 1;

  std::string csv_data;
  in.seekg(0, std::ios::end);
  csv_data.resize((size_t)in.tellg());
  in.seekg(0, std::ios::beg);
  in.read(csv_data.data(), (std::streamsize)csv_data.size());

  csv::CSVFormat fmt;
  fmt.delimiter(',')
     .trim({' ', '\t', '\r'})
     .header_row(0)
     .variable_columns(csv::VariableColumnPolicy::IGNORE_ROW);
  fmt.quote('\0'); // disable quoting

  std::istringstream iss(csv_data);
  csv::CSVReader reader(iss, fmt);

  const size_t expected_cols = reader.get_col_names().size();
  std::cerr << "expected_cols=" << expected_cols << "\n";

  size_t n = 0;
  for (csv::CSVRow &row : reader) {
    ++n;
    if (n >= 61638 && n <= 61642) {
      auto sv = row["ets"].get_sv();
      std::cerr << "n=" << n << " row.size=" << row.size()
                << " ets_sv='" << std::string(sv) << "'\n";
    }
    if (n > 61650) break;
  }
}

Build (example):

g++ -std=c++23 -O2 repro.cpp -o repro -lvincentlaucsb-csv-parser-csv
./repro

Why this is a bug

ets is the first column; the raw file line at that position starts with a numeric timestamp.
The library returning a multi-line value for a field indicates row/field boundary corruption.

Suspected Root Cause (likely in CSVReader chunk/thread orchestration)

This appears around the chunk boundary and looks like the stream reader starts parsing a chunk mid-row or corrupts parser state, causing a field to include newline(s).

In our local analysis, this resembles a race in CSVReader::read_row() / worker-thread scheduling where the code can mis-handle the “worker active” signal and/or check EOF before joining the worker thread, potentially leading to unexpected extra next() calls or concurrent use of the parser.

Additional context

We also observed (and already reported separately) #280 that mmap path can throw std::error_code from a worker thread and crash via std::terminate. This stream corruption seems to be a separate user-visible symptom but may share the same CSVReader scheduling root cause.

TA2601.csv

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions