-
Notifications
You must be signed in to change notification settings - Fork 260
Description
bcftools mpileup doesn't seem to check if there were errors while reading read data in. This means that bcftools mpileup can appear to have run successfully, but there may have been an underlying data reading problem. As an example, I created 3 bams and crams as follows:
good.{bam,cram} : these are a small valid bam/cram.
no_eof.{bam,cram}: these are good.{bam,cram} but with the eof markers removed from the end of the file.
truncated_corrupted.{bam,cram}: these are good.{bam,cram} but with a random set of bytes (including the eof bytes) removed from the end of the file.
I then ran these files through both bcftools mpileup and samtools view, using the following script.
#!/usr/bin/env bash
call_bcftools_mpileup() {
echo "bcftools mpileup " $1
bcftools mpileup -f ~/references/Homo_sapiens_assembly38.fasta -o /dev/null $1
echo "returns " $?
yes '' | head -n2
}
call_samtools_view() {
echo "samtools view " $1
samtools view -t ~/references/Homo_sapiens_assembly38.fasta -o /dev/null $1
echo "returns " $?
yes '' | head -n2
}
call_bcftools_mpileup good.cram
call_samtools_view good.cram
call_bcftools_mpileup good.bam
call_samtools_view good.bam
call_bcftools_mpileup no_eof.cram
call_samtools_view no_eof.cram
call_bcftools_mpileup no_eof.bam
call_samtools_view no_eof.bam
call_bcftools_mpileup truncated_corrupted.cram
call_samtools_view truncated_corrupted.cram
call_bcftools_mpileup truncated_corrupted.bam
call_samtools_view truncated_corrupted.bamthis results in the following output:
bcftools mpileup good.cram
[mpileup] 1 samples in 1 input files
[mpileup] maximum number of reads per input file set to -d 250
returns 0
samtools view good.cram
returns 0
bcftools mpileup good.bam
[mpileup] 1 samples in 1 input files
[mpileup] maximum number of reads per input file set to -d 250
returns 0
samtools view good.bam
returns 0
bcftools mpileup no_eof.cram
[mpileup] 1 samples in 1 input files
[mpileup] maximum number of reads per input file set to -d 250
[W::hts_close] EOF marker is absent. The input is probably truncated
returns 0
samtools view no_eof.cram
[W::hts_close] EOF marker is absent. The input is probably truncated
returns 0
bcftools mpileup no_eof.bam
[W::bam_hdr_read] EOF marker is absent. The input is probably truncated
[mpileup] 1 samples in 1 input files
[mpileup] maximum number of reads per input file set to -d 250
returns 0
samtools view no_eof.bam
[W::bam_hdr_read] EOF marker is absent. The input is probably truncated
returns 0
bcftools mpileup truncated_corrupted.cram
[mpileup] 1 samples in 1 input files
[mpileup] maximum number of reads per input file set to -d 250
returns 0
samtools view truncated_corrupted.cram
samtools view: error reading file "truncated_corrupted.cram"
returns 1
bcftools mpileup truncated_corrupted.bam
[W::bam_hdr_read] EOF marker is absent. The input is probably truncated
[mpileup] 1 samples in 1 input files
[mpileup] maximum number of reads per input file set to -d 250
[E::bgzf_read_block] Failed to read BGZF block data at offset 289804 expected 9098 bytes; hread returned 8902
[E::bgzf_read] Read block operation failed with error 4 after 0 of 4 bytes
returns 0
samtools view truncated_corrupted.bam
[W::bam_hdr_read] EOF marker is absent. The input is probably truncated
[E::bgzf_read_block] Failed to read BGZF block data at offset 289804 expected 9098 bytes; hread returned 8902
[E::bgzf_read] Read block operation failed with error 4 after 0 of 4 bytes
samtools view: error reading file "truncated_corrupted.bam"
samtools view: error closing "truncated_corrupted.bam": -1
returns 1A few things to note.
- When the EOF marker is missing, but the input data is otherwise fine, both
samtools viewandbcftools mpileupwill print a warning, but return a code of 0, indicating success. This feels a little counterintuitive to me (I would expect an error return code here), but certainly I see the argument for this behavior. - Much more problematically, IMO, when the truncated cram is used, bcftools will not print any warning or error at all, and will still return a code of 0. This is much more of a problem, imo, as this data is clearly corrupted. On this file, samtools does print an error and return a non-zero error code.
- The truncated bam behaves a little bit better, in that at least an error message is printed, though bcftools still returns a return code of 0.
My main concern about this is that this mimics the behavior we would see if a totally valid cram or bam was being streamed from a google bucket, and the google endpoint decided to return a 429 (or 404, or some other error code) at some point in the middle of streaming the data. I think that in this scenario bcftools would output a valid pileup file, with no indication that anything was amiss.