Skip to content

duckdb_loader.py silently corrupts the last column on CRLF input files #541

@turbomam

Description

@turbomam

Context

Surfaced in Copilot's 2026-04-16 review of #531.

kg_microbe/query_utils/duckdb_loader.py sets lineterminator="\n" on the pandas reader and strips \r from the header line — but not from the rest of the file. On a file with CRLF (\r\n) line endings this leaves a trailing \r in the last field of every data row, because \r is no longer treated as part of the newline.

Effects:

  • string values are silently altered ("foo" becomes "foo\r")
  • equality filtering and indexing break in ways that are hard to spot — most viewers render the \r invisibly
  • downstream joins on those columns silently drop rows

Suggested fix

Either:

  • drop the custom lineterminator and let pandas handle CRLF normally, or
  • normalize the file contents on load (strip \r from every line, not only the header).

Option 1 is the simpler fix unless there's a concrete reason the custom terminator was introduced.

File involved

  • kg_microbe/query_utils/duckdb_loader.py

References

  • PR #531
  • Copilot review at commit 1de973d, 2026-04-16T23:15Z

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions