Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Loosing information after update from R9 to R10 flowcell #160

Closed
Lipinski-B opened this issue Jan 13, 2023 · 18 comments
Closed

Loosing information after update from R9 to R10 flowcell #160

Lipinski-B opened this issue Jan 13, 2023 · 18 comments

Comments

@Lipinski-B
Copy link

Hi,

I'm facing a curious problem. I've sequenced twice a sample : first with the R9 flowcell version and second with the R10 flowcell version, and it looks like I'm loosing some information in the clair3 results when I am interested in a particular variant. Here some results :

Both sequencing had been proceeded with :

_ Guppy : ont_guppy_for_gridion 6.3.8
_ Clair3 : Clair3 v0.1-r11 proceeded with a seqtk subsampling of 1000 reads (higher than Q7) from the initial fastq
1. Command line with the flowcells R9.4.1 with Kit 12 chemistry :
/opt/bin/run_clair3.sh \
    --bam_fn="run_minimap2_ont.bam" --ref_fn="hg19.fasta" --threads=6 --platform="ont" \
    --model_path="r941_prom_sup_g5014" --output="run" --bed_fn="gene.bed" --enable_phasing

Result : I found my variant in 100% (among 100 tests) of my VCF outputs.

2. Command line with the flowcells R10.4.1 with Kit 14 chemistry :
/opt/bin/run_clair3.sh \
    --bam_fn="run_minimap2_ont.bam" --ref_fn="hg19.fasta" --threads=6 --platform="ont" \
    --model_path="r1041_e82_400bps_hac_g632" --output="run" --bed_fn="gene.bed" --enable_phasing

Result : I found my variant in 91% (among 100 tests) of my VCF outputs.

Do you know a potential reason about why I'm loosing my interested variant in 9% of the case when I'm upgrading the flowcell version from R9 to R10? I've also tested the R10 protocol with the r1041_e82_260bps_hac_g632 and all the VCF outputs didn't contain my variant of interest.

Many thanks in advance.
Best regards,
Boris

@aquaskyline
Copy link
Member

Can you send over the BAM of one or a few failed r10 tests? Just the reads covering your interested variant will be enough.

@Lipinski-B
Copy link
Author

Sure! Can you tell me if there is a way to send it to you without going through github ?

@aquaskyline
Copy link
Member

Please send to the two email addresses at https://github.com/HKU-BAL/Clair3#readme

@zhengzhenxian
Copy link
Collaborator

Hi,
Seem you are using hac model for r10 data, Is your r10 data basecalled with guppy hac model or the sup model?

@Lipinski-B
Copy link
Author

Hi, thanks for the answers, I'll send the bam asap.
Our data are basecalled with guppy hac model.

@Lipinski-B
Copy link
Author

I've just sent you one of the negative bam by mail.
Many thanks in advance.

@Lipinski-B
Copy link
Author

Lipinski-B commented Jan 17, 2023

A little update regarding this issues :

It looks like I arrive to catch my interested variant at 100% (among 100 test) when I put the option --ref_pct_full to a threashold of 0.7.

As it's an experimental option, I don't know how dangerous it is to play with it. Maybe there is some risk to turn false the informations inside my vcf output when I put up the threashold from 0.1 to 0.7 in this context? I'll continue to investigate in order to know what threashold makes me find/loose my interested variant.

In the meantime, I have to say that I don't clearly understand what does this option do in the process.

--ref_pct_full=FLOAT      EXPERIMENTAL: Specify an expected percentage of low quality 0/0 variants called in the pileup mode for full-alignment mode calling, default: 0.3 for ilmn and hifi, 0.1 for ont.

Can you explain me what is the purpose of this option? And why does it change that much my results?

Best regards,
Boris

@aquaskyline
Copy link
Member

Changing --ref_pct_full to 0.7 is a good try, it runs slower cuz more variants go into the slower full-alignment based calling, but the performance is monotonically better if not much better most of the time. If the slowdown is acceptable, or you prefer performance way over speed, setting both --var_pct_full and --ref_pct_full to a higher value is suggested.

@Lipinski-B
Copy link
Author

Lipinski-B commented Jan 18, 2023

Very interesting, many thanks for this answer.

I don't really care about the speed, what I really need is to be sure to see my interested variants every time I'm running Clair3, no mater the sample sequenced. So here I have two question :

_ Is a good practice to set both variable, --var_pct_full and --ref_pct_full, equal to 1 in order to catch all my interested variants? How could it be problematic regarding the information I'll have in my final VCF?

_ Why this problem didn't seems to appear when I run Clair3 on a R9 sequenced sample, especially when the same pipeline protocol is applied? Is that a question of read quality? Or a potential wrong match between chemestry and clair3 model on R10?

@aquaskyline
Copy link
Member

Full-alignment model runs slower but is more sensitive and accurate than the pileup model in the repetitive and low-complexity regions. Setting --var_pct_full and --ref_pct_full to 1 means sending all reference call and variant call candidates in the full-alignment model. If the 100 tests you've done are at relatively more challenging genome positions, setting --var_pct_full and --ref_pct_full to 1 certainly leads to more correct answers. The default parameters work best at whole genome calling. In your use case, if you are only genotyping the 100 positions of interest, setting both --var_pct_full and --ref_pct_full to a higher value or even 1 is reasonable. Regarding why R9 got more correct answers than R10 in your tests, there are a few possible explanations, let me look in to the case you sent and turn back to you later.

@philres
Copy link

philres commented Jan 18, 2023

HI Boris,

quick question. What's the depth of your datasets? Is it amplicon data with high depth (much higher than 60X for example)?

Thanks,
Philipp

@mproberts99
Copy link

Hi All,
Thank you for providing the tip to tweak those parameters. I am also running the R10 clair3 models on R10 data and was seeing worse performance than R9, despite it being the same samples. The data itself looks very clean in IGV so it seems to be something wrong with the caller. Our data is amplicon data (around 1000x) so could it be the subsampling? Is it different for R10 vs. R9?

Thank you!

@aquaskyline
Copy link
Member

@mproberts99 Thank you for the information. Working with colleagues from ONT, we noticed that the R10 models were trained with a coverage relatively lower than that of the R9 model. For high-depth amplicon data, an immediate solution is to set --var_pct_full and --ref_pct_full to 1 to use the full-alignment model that always subsamples the input to at most 89-fold to handle all variant candidates. In the long run, we are working with colleagues from ONT to either increase the training coverage, or fine-tune the code to handle high-depth amplicon input. Will post more updates here.

@Lipinski-B
Copy link
Author

Lipinski-B commented Jan 19, 2023

Hi everyone,

@philres Previously to the variant calling, I'm doing a subsample of 1000 reads. What I'm seeing, knowing that I'm taking reads with quality higher than Q7, is that in every case I can get a depth higher than 950X, calculate by mosdepth software.

@mproberts99 I had this hypothesis too, that the subsampling is leading wrong our results by loosing information. What I saw from many bootstraps is that I always can get all the requested variants informations when I'm running in R9, compare to R10 tests (where I'm loosing informations in 9% of my case, without touching any options). Curiously, I had exactly the opposite problem for an another sample, I was able to catch my interested variants in R10, but not in R9. Related to the fact that in my case I'm taking care about the sequencing of only one gene, I suppose it could be more related to pipelines options/models than to the subsampling, even if the subsampling is still able to bring variability in the results by taking more reads with higher or lower quality, that could maybe plays in the Clair3 process (?).

@aquaskyline Thanks again, I'm happy to confirm that by setting --var_pct_full, --ref_pct_full and --var_pct_phasing to 1, I'm able to catch at 100% every interested variants using the R10 flowcells in my VCF output.

@Lipinski-B
Copy link
Author

Hi,

I would like to share a problem that I have when I do a variant callilg in the BRCA1 gene. I've done 2 iterations of my pipeline from the same sequenced sample, and I have 2 differents results.

Let me explain :

My pipeline is taking a subsample of 1000 read above 15 Q quality from raw FASTQs files, and we do a variant calling with clair 3 with --var_pct_full, --ref_pct_full and --var_pct_phasing equal to 1.

In the first iteration, I've got the result for this variant :

17	43068303	rs8176228;209352	C	A	25.09	PASS	...

It appears in pileup.vcf.gz :

17      43068303        .       C       .       12.90   RefCall P       GT:GQ:DP:AF     0/0:12:858:0.5035

It appears in full_alignment.vcf.gz :

17      43068303        .       C       A       25.09   PASS    F       GT:GQ:DP:AF     0/1:25:858:0.4895

I can see in the log that this variant appear as "RefCall" in the pileup mode, and then find again as 0/1 in full pipeline. Here the log of clair3 :

[INFO] 3/7 Phase VCF file using Whatshap
This is WhatsHap 1.4 running under Python 3.9.0
Working on 1 samples from 1 family
======== Working on chromosome '17'
---- Processing individual SAMPLE
Using maximum coverage per sample of 15X
Number of variants skipped due to missing genotypes: 0
Number of remaining heterozygous variants: 3
Reading alignments on chromosome 17 and detecting alleles ...
Found 926 reads covering 3 variants
Kept 793 reads that cover at least two variants each
Reducing coverage to at most 15X by selecting most informative reads ...
Selected 15 reads covering 3 variants
Variants covered by at least one phase-informative read in at least one individual after read selection: 3
Phasing 1 sample by solving the MEC problem ...
MEC cost: 60
No. of phased blocks: 1
Largest block contains 3 variants (100.0% of accessible variants) between position 43065857 and 43067787
======== Writing VCF
Done writing VCF

== SUMMARY ==
Maximum memory usage: 0.209 GB
Time spent reading BAM/CRAM:                    1.2 s
Time spent parsing VCF:                         0.0 s
Time spent selecting reads:                     0.1 s
Time spent phasing:                             0.0 s
Time spent writing VCF:                         0.0 s
Time spent finding components:                  0.0 s
Time spent on rest:                             0.6 s
Total elapsed time:                             1.9 s

real	0m2.817s
user	0m0.921s
sys	0m0.521s

[INFO] 5/7 Select candidates for full-alignment calling
[INFO] Set variants quality cutoff 12.0
[INFO] Set reference calls quality cutoff 14.0
[INFO] Low quality reference calls to be processed in 17: 12
[INFO] Low quality variants to be processed in 17: 6

real	0m0.851s
user	0m0.305s
sys	0m0.215s

[INFO] 6/7 Call low-quality variants using full-alignment model
Calling variants ...
Total processed positions in 17 (chunk 1/1) : 18
Total time elapsed: 19.45 s

real	0m24.078s
user	0m3.960s
sys	0m0.877s

[INFO] 7/7 Merge pileup VCF and full-alignment VCF
[INFO] Pileup variants processed in 17: 1
[INFO] Full-alignment variants processed in 17: 16

real	0m2.127s
user	0m0.403s
sys	0m0.264s
[INFO] 7/7 Phasing VCF output in parallel using WhatsHap
This is WhatsHap 1.4 running under Python 3.9.0
Working on 1 samples from 1 family
======== Working on chromosome '17'
---- Processing individual SAMPLE
Using maximum coverage per sample of 15X
Number of variants skipped due to missing genotypes: 0
Number of remaining heterozygous variants: 10
Reading alignments on chromosome 17 and detecting alleles ...
Found 936 reads covering 10 variants
Kept 936 reads that cover at least two variants each
Reducing coverage to at most 15X by selecting most informative reads ...
Selected 15 reads covering 10 variants
Best-case phasing would result in 1 non-singleton phased blocks (1 in total)
... after read selection: 1 non-singleton phased blocks (1 in total)
Variants covered by at least one phase-informative read in at least one individual after read selection: 10
Phasing 1 sample by solving the MEC problem ...
MEC cost: 390
No. of phased blocks: 1
Largest block contains 10 variants (100.0% of accessible variants) between position 43063808 and 43068303
======== Writing VCF
Done writing VCF
But in the second iteration of my pipeline, this variant doesn't appear in the full alignment vcf file

It appears in pileup.vcf.gz :

17      43068303        .       C       .       14.31   RefCall P       GT:GQ:DP:AF     0/0:14:870:0.5046

and I have this Clair3 log is here :

[INFO] 3/7 Phase VCF file using Whatshap
This is WhatsHap 1.4 running under Python 3.9.0
Working on 1 samples from 1 family
======== Working on chromosome '17'
---- Processing individual SAMPLE
Using maximum coverage per sample of 15X
Number of variants skipped due to missing genotypes: 0
Number of remaining heterozygous variants: 3
Reading alignments on chromosome 17 and detecting alleles ...
Found 932 reads covering 3 variants
Kept 773 reads that cover at least two variants each
Reducing coverage to at most 15X by selecting most informative reads ...
Selected 15 reads covering 3 variants
Variants covered by at least one phase-informative read in at least one individual after read selection: 3
Phasing 1 sample by solving the MEC problem ...
MEC cost: 60
No. of phased blocks: 1
Largest block contains 3 variants (100.0% of accessible variants) between position 43065857 and 43067787
======== Writing VCF
Done writing VCF

== SUMMARY ==
Maximum memory usage: 0.208 GB
Time spent reading BAM/CRAM:                    1.3 s
Time spent parsing VCF:                         0.0 s
Time spent selecting reads:                     0.1 s
Time spent phasing:                             0.0 s
Time spent writing VCF:                         0.0 s
Time spent finding components:                  0.0 s
Time spent on rest:                             0.2 s
Total elapsed time:                             1.6 s

real	0m2.320s
user	0m0.624s
sys	0m0.208s

[INFO] 5/7 Select candidates for full-alignment calling
[INFO] Set variants quality cutoff 10.0
[INFO] Set reference calls quality cutoff 14.0
[INFO] Low quality reference calls to be processed in 17: 12
[INFO] Low quality variants to be processed in 17: 5

real	0m0.634s
user	0m0.240s
sys	0m0.078s

[INFO] 6/7 Call low-quality variants using full-alignment model
Calling variants ...
Total processed positions in 17 (chunk 1/1) : 17
Total time elapsed: 4.60 s

real	0m7.262s
user	0m3.035s
sys	0m0.451s

[INFO] 7/7 Merge pileup VCF and full-alignment VCF
[INFO] Pileup variants processed in 17: 1
[INFO] Full-alignment variants processed in 17: 15

real	0m0.546s
user	0m0.282s
sys	0m0.106s
[INFO] 7/7 Phasing VCF output in parallel using WhatsHap
This is WhatsHap 1.4 running under Python 3.9.0
Working on 1 samples from 1 family
======== Working on chromosome '17'
---- Processing individual SAMPLE
Using maximum coverage per sample of 15X
Number of variants skipped due to missing genotypes: 0
Number of remaining heterozygous variants: 9
Reading alignments on chromosome 17 and detecting alleles ...
Found 946 reads covering 9 variants
Kept 946 reads that cover at least two variants each
Reducing coverage to at most 15X by selecting most informative reads ...
Selected 15 reads covering 9 variants
Best-case phasing would result in 1 non-singleton phased blocks (1 in total)
... after read selection: 1 non-singleton phased blocks (1 in total)
Variants covered by at least one phase-informative read in at least one individual after read selection: 9
Phasing 1 sample by solving the MEC problem ...
MEC cost: 120
No. of phased blocks: 1
Largest block contains 9 variants (100.0% of accessible variants) between position 43063808 and 43068206
======== Writing VCF
Done writing VCF

I don't understand. Do you kwow why I lose my variant from one iteration to the other in the full aligment pipeline?

Many thanks in advance.
Best regards,
Boris

@Lipinski-B Lipinski-B reopened this Jun 30, 2023
@aquaskyline
Copy link
Member

Weird. I saw log differences at phasing. Could you run a third iteration and see if the log changes again?

@Lipinski-B
Copy link
Author

Hi,

I've taken time to make several double iteration.
I'm surprised to see this kind of mistake several time : I can see sometimes a variant called on the first iteration that I can miss in the second iteration. I've done it always for BRCA1, always from the same dataset, but I can see this mistake for several variant, not only for 43068303 that I've shown you above (ex : 43063808 and 43067787).

We stay in touch,
Best regards,
Boris

@aquaskyline
Copy link
Member

aquaskyline commented Jul 20, 2023

If possible, could you please share the data with me for troubleshooting?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants