Skip to content

bcftools mpileup handling of corrupted/truncated data/streaming errors #2177

@kachulis

Description

@kachulis

bcftools mpileup doesn't seem to check if there were errors while reading read data in. This means that bcftools mpileup can appear to have run successfully, but there may have been an underlying data reading problem. As an example, I created 3 bams and crams as follows:
good.{bam,cram} : these are a small valid bam/cram.
no_eof.{bam,cram}: these are good.{bam,cram} but with the eof markers removed from the end of the file.
truncated_corrupted.{bam,cram}: these are good.{bam,cram} but with a random set of bytes (including the eof bytes) removed from the end of the file.

I then ran these files through both bcftools mpileup and samtools view, using the following script.

#!/usr/bin/env bash

call_bcftools_mpileup() {
	echo "bcftools mpileup " $1
	bcftools mpileup -f ~/references/Homo_sapiens_assembly38.fasta -o /dev/null $1
	echo "returns " $?
	yes '' | head -n2
}

call_samtools_view() {
	echo "samtools view " $1
	samtools view -t ~/references/Homo_sapiens_assembly38.fasta -o /dev/null $1
	echo "returns " $?
	yes '' | head -n2
}

call_bcftools_mpileup good.cram

call_samtools_view good.cram

call_bcftools_mpileup good.bam

call_samtools_view good.bam

call_bcftools_mpileup no_eof.cram

call_samtools_view no_eof.cram

call_bcftools_mpileup no_eof.bam

call_samtools_view no_eof.bam

call_bcftools_mpileup truncated_corrupted.cram

call_samtools_view truncated_corrupted.cram

call_bcftools_mpileup truncated_corrupted.bam

call_samtools_view truncated_corrupted.bam

this results in the following output:

bcftools mpileup  good.cram
[mpileup] 1 samples in 1 input files
[mpileup] maximum number of reads per input file set to -d 250
returns  0


samtools view  good.cram
returns  0


bcftools mpileup  good.bam
[mpileup] 1 samples in 1 input files
[mpileup] maximum number of reads per input file set to -d 250
returns  0


samtools view  good.bam
returns  0


bcftools mpileup  no_eof.cram
[mpileup] 1 samples in 1 input files
[mpileup] maximum number of reads per input file set to -d 250
[W::hts_close] EOF marker is absent. The input is probably truncated
returns  0


samtools view  no_eof.cram
[W::hts_close] EOF marker is absent. The input is probably truncated
returns  0


bcftools mpileup  no_eof.bam
[W::bam_hdr_read] EOF marker is absent. The input is probably truncated
[mpileup] 1 samples in 1 input files
[mpileup] maximum number of reads per input file set to -d 250
returns  0


samtools view  no_eof.bam
[W::bam_hdr_read] EOF marker is absent. The input is probably truncated
returns  0


bcftools mpileup  truncated_corrupted.cram
[mpileup] 1 samples in 1 input files
[mpileup] maximum number of reads per input file set to -d 250
returns  0


samtools view  truncated_corrupted.cram
samtools view: error reading file "truncated_corrupted.cram"
returns  1


bcftools mpileup  truncated_corrupted.bam
[W::bam_hdr_read] EOF marker is absent. The input is probably truncated
[mpileup] 1 samples in 1 input files
[mpileup] maximum number of reads per input file set to -d 250
[E::bgzf_read_block] Failed to read BGZF block data at offset 289804 expected 9098 bytes; hread returned 8902
[E::bgzf_read] Read block operation failed with error 4 after 0 of 4 bytes
returns  0


samtools view  truncated_corrupted.bam
[W::bam_hdr_read] EOF marker is absent. The input is probably truncated
[E::bgzf_read_block] Failed to read BGZF block data at offset 289804 expected 9098 bytes; hread returned 8902
[E::bgzf_read] Read block operation failed with error 4 after 0 of 4 bytes
samtools view: error reading file "truncated_corrupted.bam"
samtools view: error closing "truncated_corrupted.bam": -1
returns  1

A few things to note.

  1. When the EOF marker is missing, but the input data is otherwise fine, both samtools view and bcftools mpileup will print a warning, but return a code of 0, indicating success. This feels a little counterintuitive to me (I would expect an error return code here), but certainly I see the argument for this behavior.
  2. Much more problematically, IMO, when the truncated cram is used, bcftools will not print any warning or error at all, and will still return a code of 0. This is much more of a problem, imo, as this data is clearly corrupted. On this file, samtools does print an error and return a non-zero error code.
  3. The truncated bam behaves a little bit better, in that at least an error message is printed, though bcftools still returns a return code of 0.

My main concern about this is that this mimics the behavior we would see if a totally valid cram or bam was being streamed from a google bucket, and the google endpoint decided to return a 429 (or 404, or some other error code) at some point in the middle of streaming the data. I think that in this scenario bcftools would output a valid pileup file, with no indication that anything was amiss.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions