Skip to content

Conversation

@heuermh
Copy link
Contributor

@heuermh heuermh commented Sep 11, 2019

ADAM fails unit tests after bumping the htsjdk dependency version from 2.19.x to 2.20.x, see
bigdatagenomics/adam#2195

This pull request demonstrates that Hadoop-BAM shares the same issue.

$ mvn test -Dtest=TestSAMInputFormat
...
[ERROR] Tests run: 4, Failures: 1, Errors: 1, Skipped: 0, Time elapsed: 2.77 s <<< FAILURE! - in org.seqdoop.hadoop_bam.TestSAMInputFormat
[ERROR] testCorruptedHeaderSAM(org.seqdoop.hadoop_bam.TestSAMInputFormat)  Time elapsed: 0.829 s  <<< ERROR!
htsjdk.samtools.SAMFormatException:
Error parsing text SAM file. RNAME '75M' not found in any SQ record; Line 3
Line: 1	60	75M	*	0	0	GTATAAGAGCAGCCTTATTCCTATTTATAATCAGGGTGAAACACCTGTGCCAATGCCAAGACAGGGGTGCCAAGA	*	NM:i:0	AS:i:75	XS:i:0
	at org.seqdoop.hadoop_bam.TestSAMInputFormat.testCorruptedHeaderSAM(TestSAMInputFormat.java:151)

[ERROR] testCorruptedReadNameSAM(org.seqdoop.hadoop_bam.TestSAMInputFormat)  Time elapsed: 0.033 s  <<< FAILURE!
org.junit.ComparisonFailure: expected:<[simread:1:2]6472783:false> but was:<[]6472783:false>
	at org.seqdoop.hadoop_bam.TestSAMInputFormat.testCorruptedReadNameSAM(TestSAMInputFormat.java:131)

[INFO]
[INFO] Results:
[INFO]
[ERROR] Failures:
[ERROR]   TestSAMInputFormat.testCorruptedReadNameSAM:131 expected:<[simread:1:2]6472783:false> but was:<[]6472783:false>
[ERROR] Errors:
[ERROR]   TestSAMInputFormat.testCorruptedHeaderSAM:151 » SAMFormat Error parsing text S...
[INFO]
[ERROR] Tests run: 4, Failures: 1, Errors: 1, Skipped: 0
[INFO]
[INFO] ------------------------------------------------------------------------
[INFO] BUILD FAILURE
[INFO] ------------------------------------------------------------------------

@heuermh
Copy link
Contributor Author

heuermh commented Sep 17, 2019

@lbergelson @cmnbroad This pull request demonstrates an issue we see in ADAM and in Hadoop-BAM but not in disq. Might you be able to help me make a reproducing case in htsjdk, to see if the problem is there or in Hadoop-BAM?

@cmnbroad
Copy link
Collaborator

@heuermh - that seems bizarre. I'll have a look in the next day or so and see if I can track it down or repro.

@cmnbroad
Copy link
Collaborator

@heuermh It looks like these failures are caused by the combination of a change in htsjdk and an invalid assumption inherent in this code in WorkaroundingStream. That code block in WorkaroundingStream probably never executed before, but does now because SAMFileHeader.getTextHeader() has been deprecated and was changed to always return null (I believe that change was motivated by a performance issue with very large headers).

The code in WorkaroundingStream assumes that the length of the string resulting from re-encoding the header object will be the same as the original header's text length, and therefore can be used as an offset to find the end of the header in the input stream. Both of the failing test cases are missing the optional @HD line, so in those cases the re-encoded text is longer than the original because the encoder inserts an @HD line during encode. You can verify this by seeing that the tests pass if the @HD line is added to the input files. Not sure what they real fix should be though, the WorkaroundingStream seems pretty fragile.

@lbergelson
Copy link
Contributor

@cmnbroad Wow, nice work tracking that down.

@heuermh
Copy link
Contributor Author

heuermh commented Sep 18, 2019

Ah right, I mistakenly waded into the SAMFileHeader.getTextHeader() mess once before. ;)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants