Skip to content

Conversation

@pinin4fjords
Copy link
Member

Description

Fixes nf-core/rnaseq#1445

This PR fixes an issue where R's data.frame() function was automatically modifying sample names, causing downstream errors when trying to match sample IDs between count matrices and samplesheet metadata.

The Problem

R's data.frame() automatically modifies column/row names when check.names=TRUE (the default):

  • Sample names starting with numbers get an "X" prepended: 1A2X1A2
  • Hyphens get converted to dots: D10-DD10.D
  • Other special characters are also modified

This caused the SUMMARIZEDEXPERIMENT process to fail with:

Error in findColumnWithAllEntries(ids, metadata) : 
  No column contains all vector entries

Root Cause

While PR #6638 partially fixed this by adding check.names = FALSE to the build_table() function, it missed three additional locations where data.frame() and read.csv() calls were made without this parameter.

The most critical one was at line 134 where coldata is created - this directly sets the sample names that become column names in all output matrices.

Changes Made

Added check.names = FALSE to three function calls in tximport.r:

  1. Line 76: read.csv() when reading transcript info
  2. Line 79: data.frame() when creating extra transcript info rows
  3. Line 134: data.frame() when creating coldata (main bug fix)

Testing

This fix ensures that sample names are preserved exactly as provided in the input, preventing mismatches downstream. Users can now safely use:

  • Sample names starting with numbers (e.g., 1A2, 5B2)
  • Sample names with hyphens (e.g., sample-1, D10-D)
  • Any other valid sample name format

PR checklist

  • This comment contains a description of changes (with reason)
  • If you've fixed a bug or added code that should be tested, add tests!
  • Ensure the test suite passes (nf-test test path/to/test.nf.test)
  • Usage Documentation in docs/usage.md is updated
  • Output Documentation in docs/output.md is updated
  • CHANGELOG.md is updated
  • README.md is updated (including new tool citations and authors/contributors)

Addresses nf-core/rnaseq#1445

R's data.frame() function automatically modifies column names when
check.names=TRUE (the default), which causes issues with sample names that:
- Start with numbers (prepends "X": "1A2" -> "X1A2")
- Contain special characters like hyphens (converts to dots: "D10-D" -> "D10.D")

This caused the downstream summarizedexperiment script to fail when trying
to match sample IDs from count matrices against the samplesheet metadata,
as the names no longer matched.

PR #6638 partially fixed this issue by adding check.names=FALSE to the
build_table() function, but missed three additional data.frame() and
read.csv() calls that also needed this parameter.

This commit adds check.names=FALSE to:
1. Line 76: read.csv() when reading transcript info
2. Line 79: data.frame() when creating extra transcript info rows
3. Line 134: data.frame() when creating coldata (the main bug)

The coldata fix (line 134) is the most critical as it directly affects
sample names that become column names in the output matrices.
Copy link
Member

@JoseEspinosa JoseEspinosa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@pinin4fjords pinin4fjords added this pull request to the merge queue Nov 14, 2025
@pinin4fjords
Copy link
Member Author

Thanks @JoseEspinosa !

Copy link
Contributor

@SPPearce SPPearce left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this actually going to fix the problem downstream?
Any other R package used is likely to fail in this manner.

Merged via the queue into master with commit d205ebc Nov 14, 2025
14 checks passed
@pinin4fjords pinin4fjords deleted the fix-tximport-sample-names branch November 14, 2025 15:00
@pinin4fjords
Copy link
Member Author

Is this actually going to fix the problem downstream? Any other R package used is likely to fail in this manner.

I've already spent some time nobbling other instances, hoping this will catch the remainder

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Error in findColumnWithAllEntries(ids, metadata) : No column contains all vector entries

4 participants