Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Option to incorporate additional metadata from provided SAM header file #95

Merged
merged 27 commits into from
Jul 12, 2024

Conversation

tkchafin
Copy link
Contributor

@tkchafin tkchafin commented Jun 28, 2024

Added a new step just before CRAM generation which adds additional metadata to the output alignment files from the optional --header argument. The default behaviour is that this replaces the @hd and @sq lines (i.e., keeping any defined read groups etc), which can be defined using ext.args.

Note the extra header file for now is stored in assets/ until I can get the hosted test_data updated (added in Jira as a separate ticket)

PR checklist

  • This comment contains a description of changes (with reason).
  • If you've fixed a bug or added code that should be tested, add tests!
  • If you've added a new tool - have you followed the pipeline conventions in the contribution docs
  • Make sure your code lints (nf-core lint).
  • Ensure the test suite passes (nextflow run . -profile test,docker --outdir <OUTDIR>).
  • Usage Documentation in docs/usage.md is updated.
  • Output Documentation in docs/output.md is updated.
  • CHANGELOG.md is updated.
  • README.md is updated (including new tool citations and authors/contributors).

@tkchafin tkchafin linked an issue Jun 28, 2024 that may be closed by this pull request
@tkchafin tkchafin requested a review from muffato July 1, 2024 09:17
Copy link
Member

@muffato muffato left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the test profile, this is how the BAM file is modified:

$ samtools view -H mMelMel3.merge.bam
@HD     VN:1.6  SO:coordinate
@SQ     SN:OV277441.1   LN:7999920
@RG     ID:m64094_200910_173211 PL:PACBIO       SM:mMelMel3
@RG     ID:m64094_200911_174739 PL:PACBIO       SM:mMelMel3
@PG     ID:minimap2     PN:minimap2     VN:2.24-r1122   CL:minimap2 -ax map-hifi --cs=short -R @RG\tID:m64094_200910_173211\tPL:PACBIO\tSM:mMelMel3 -I1.0G -t 2 -a GCA_922984935.2.subset.unmasked.fasta mMelMel3_T3.fastq.gz
@PG     ID:samtools     PN:samtools     PP:minimap2     VN:1.18 CL:samtools sort -@ 2 -o mMelMel3_T3.bam
@PG     ID:samtools.1   PN:samtools     PP:samtools     VN:1.17 CL:samtools merge --threads 0 -c -p mMelMel3.merge.bam 1/mMelMel3_T3.bam 2/mMelMel3_T4.bam
@PG     ID:samtools.2   PN:samtools     PP:samtools.1   VN:1.20 CL:/usr/local/bin/samtools view -H mMelMel3.merge.bam
$ samtools view -H mMelMel3.reheader.bam
@HD     VN:1.0  SO:unsorted
@SQ     SN:OV277441.1   LN:7999920      M5:0457acf8690429f1c98ee545cb8573b8     UR:https://tolit.cog.sanger.ac.uk/test-data/Meles_meles/assembly/release/mMelMel3.1_paternal_haplotype/GCA_922984935.2.subset.fasta.gz    AS:GCA_922984935.2      AN:SUPER_1      SP:Meles meles
@PG     ID:samtools     PN:samtools     VN:1.17 CL:samtools reheader GCA_922984935.2.subset.header.sam -
@PG     ID:samtools.1   PN:samtools     PP:samtools     VN:1.20 CL:/usr/local/bin/samtools view -H mMelMel3.reheader.bam

The issues are:

  • @HD SO is turned from coordinate to unsorted even though the file should still be sorted
  • @HD VN is downgraded from 1.6 to 1.0. Is that necessary ? I'd be concerned that other parts of the headers aren't 1.0 compatible, which might break parsers.
  • The @RG and original @PG are gone

By the way, check the description of TOLIT-1916. It has some sample code from Rob Davies (Sanger staff, in the Samtools team) to update the @SQ headers of a BAM file. He makes a new *.header.sam from the original BAM headers and the user-supplied headers, and then samtools reheader is called just once on that.

conf/modules.config Outdated Show resolved Hide resolved
@tkchafin
Copy link
Contributor Author

tkchafin commented Jul 4, 2024

The issues are:

  • @HD SO is turned from coordinate to unsorted even though the file should still be sorted
  • @HD VN is downgraded from 1.6 to 1.0. Is that necessary ? I'd be concerned that other parts of the headers aren't 1.0 compatible, which might break parsers.
  • The @RG and original @PG are gone

Ah I should have caught that... The new version only replaces @sq lines, and also maintains the input order of header entries from the bam file:

(nf-core) tc25@mib119777s 6389fdba930830d60cd6183622b089 % samtools view -H mMelMel3.merge.bam
@HD	VN:1.6	SO:coordinate
@SQ	SN:OV277441.1	LN:7999920
@RG	ID:m64094_200910_173211	PL:PACBIO	SM:mMelMel3
@RG	ID:m64094_200911_174739	PL:PACBIO	SM:mMelMel3
@PG	ID:minimap2	PN:minimap2	VN:2.24-r1122	CL:minimap2 -ax map-hifi --cs=short -R @RG\tID:m64094_200910_173211\tPL:PACBIO\tSM:mMelMel3 -I1.0G -t 2 -a GCA_922984935.2.subset.unmasked.fasta mMelMel3_T3.fastq.gz
@PG	ID:samtools	PN:samtools	PP:minimap2	VN:1.18	CL:samtools sort -@ 2 -o mMelMel3_T3.bam
@PG	ID:samtools.1	PN:samtools	PP:samtools	VN:1.17	CL:samtools merge --threads 0 -c -p mMelMel3.merge.bam 1/mMelMel3_T3.bam 2/mMelMel3_T4.bam
@PG	ID:samtools.2	PN:samtools	PP:samtools.1	VN:1.20	CL:samtools view -H mMelMel3.merge.bam


(nf-core) tc25@mib119777s 6389fdba930830d60cd6183622b089 % samtools view -H mMelMel3.reheader.bam
@HD	VN:1.6	SO:coordinate
@SQ	SN:OV277441.1	LN:7999920	M5:0457acf8690429f1c98ee545cb8573b8	UR:https://tolit.cog.sanger.ac.uk/test-data/Meles_meles/assembly/release/mMelMel3.1_paternal_haplotype/GCA_922984935.2.subset.fasta.gz	AS:GCA_922984935.2	AN:SUPER_1	SP:Meles meles
@RG	ID:m64094_200910_173211	PL:PACBIO	SM:mMelMel3
@RG	ID:m64094_200911_174739	PL:PACBIO	SM:mMelMel3
@PG	ID:minimap2	PN:minimap2	VN:2.24-r1122	CL:minimap2 -ax map-hifi --cs=short -R @RG\tID:m64094_200910_173211\tPL:PACBIO\tSM:mMelMel3 -I1.0G -t 2 -a GCA_922984935.2.subset.unmasked.fasta mMelMel3_T3.fastq.gz
@PG	ID:samtools	PN:samtools	PP:minimap2	VN:1.18	CL:samtools sort -@ 2 -o mMelMel3_T3.bam
@PG	ID:samtools.1	PN:samtools	PP:samtools	VN:1.17	CL:samtools merge --threads 0 -c -p mMelMel3.merge.bam 1/mMelMel3_T3.bam 2/mMelMel3_T4.bam
@PG	ID:samtools.2	PN:samtools	PP:samtools.1	VN:1.17	CL:samtools reheader .temp.sorted.header.sam mMelMel3.merge.bam
@PG	ID:samtools.3	PN:samtools	PP:samtools.2	VN:1.20	CL:samtools view -H mMelMel3.reheader.bam

workflows/readmapping.nf Outdated Show resolved Hide resolved
workflows/readmapping.nf Outdated Show resolved Hide resolved
workflows/readmapping.nf Outdated Show resolved Hide resolved
modules/local/samtools_replaceheader.nf Outdated Show resolved Hide resolved
modules/local/samtools_replaceheader.nf Outdated Show resolved Hide resolved
subworkflows/local/align_short.nf Outdated Show resolved Hide resolved
subworkflows/local/align_short.nf Outdated Show resolved Hide resolved
subworkflows/local/convert_stats.nf Outdated Show resolved Hide resolved
subworkflows/local/convert_stats.nf Outdated Show resolved Hide resolved
modules/local/samtools_replaceheader.nf Outdated Show resolved Hide resolved
modules/local/samtools_replaceheader.nf Outdated Show resolved Hide resolved
tkchafin and others added 3 commits July 11, 2024 16:49
Co-authored-by: Matthieu Muffato <cortexspam-github@yahoo.fr>
Co-authored-by: Matthieu Muffato <cortexspam-github@yahoo.fr>
Co-authored-by: Matthieu Muffato <cortexspam-github@yahoo.fr>
Copy link
Member

@muffato muffato left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Super, thank you !

@tkchafin tkchafin closed this Jul 12, 2024
@muffato muffato reopened this Jul 12, 2024
@tkchafin tkchafin merged commit 67badb4 into sanger-tol:dev Jul 12, 2024
12 checks passed
@tkchafin tkchafin deleted the samtools_reheader branch July 12, 2024 14:25
@tkchafin tkchafin mentioned this pull request Aug 20, 2024
9 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Embed the BAM headers from the insdcdownload pipeline
2 participants