Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Empty output "Insufficiently many confident reads for aggregating across runs" #27

Open
marromesc opened this issue Feb 6, 2024 · 3 comments

Comments

@marromesc
Copy link

Hi,

This is my first time using QUILT, I am trying to impute genotypes in the location of ~300 SNVs.

Preprocessing of the bam files I used included alignment to GRCh38 with BWA, marking and removing PCR duplicates with Picard and filtering out reads with MAPQ < 37 with samtools.

It works well for most of them but there's one specific location that gives me an empty output.

With buffer=1000 it throws the following message: "Insufficiently many confident reads for aggregating across runs"

While buffer=500000 continues giving me an empty output but throws the following message: "There are 5 out of 18 regions that have been flipped by consensus"

  1. What do these error messages mean?
  2. What criteria to pick up the appropriate buffer cutoff?
  3. Is it good practice to remove duplicates and low-quality reads before running QUILT?

Thank you.

Best,
Maria

@Zilong-Li
Copy link
Collaborator

Hi,

I think QUILT eats bam files and will take care of them. Normally you don't need to preprocess the bam files, especially no need for bam quality score control. The error message is due to too few reads left after your preprocessing. And increasing buffer won't help out imputing regions without any reads. Buffer size of 250000 is big enough for most regions.

@marromesc
Copy link
Author

Hi,

I checked the BAM file in IGV and indeed it has only one read covering that region, but you expect that from shallow sequencing data I guess. Is there minimum number of reads to impute a genotype?

Thanks for your answer, your clarifications about bam files preprocessing and buffer parameter are very useful.

Best,
Maria

@rwdavies
Copy link
Owner

rwdavies commented Feb 9, 2024

The messages from your first post related to how QUILT tries to do phasing

Basically, it tries multiple starts (normally 7), and gets read assignements from each of them. Then on the final phasing round, it tries to get a best set from them, and proceed. That's what the "There are 5 out of 18 regions that have been flipped by consensus" message meant, the consensus process is trying to come up with a best read phasing. Similarly, "Insufficiently many confident reads for aggregating across runs", means it can't do this process, as there are too few "confident" reads (reads that map to one or the other haplotype confidently). I wouldn't consider any of these to be error messages, the program should still run, they should just be informative.

I wouldn't say there's a minimum number of reads or depth to impute a sample. With some mice samples I've seen excellent results with less than 0.1X. It really depends on how related the samples are, and how long the LD blocks are.

Hope that helps,
Robbie

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants