Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to get p values for significant LD? #51

Open
beausoleilmo opened this issue Aug 10, 2023 · 5 comments
Open

How to get p values for significant LD? #51

beausoleilmo opened this issue Aug 10, 2023 · 5 comments

Comments

@beausoleilmo
Copy link

beausoleilmo commented Aug 10, 2023

I ran something like this:

ngsLD --geno $SUBBEAGLE/$SUBBEAGLE.gz \
      --n_ind $nind \
      --n_sites $nsites \
      --pos $POSFILE \
      --max_kb_dist 0 \
      --max_snp_dist 0 \
      --min_maf 0.001 \
      --extend_out \
      --n_threads $SLURM_CPUS_PER_TASK \
      --outH ${SUBBEAGLE.}/$SUBBEAGLE.ld
==> Input Arguments:
        geno: $SUBBEAGLE/$SUBBEAGLE.gz
        probs: false
        log_scale: false
        n_ind: 482
        n_sites: 15
        pos: $SUBBEAGLE/$SUBBEAGLE.pos.gz (WITHOUT header)
        max_kb_dist (kb): 0
        max_snp_dist: 0
        min_maf: 0.001000
        ignore_miss_data: false
        call_geno: false
        N_thresh: 0.000000
        call_thresh: 0.000000
        rnd_sample: 1.000000
        seed: 1691677158
        extend_out: true
        out: $SUBBEAGLE/$SUBBEAGLE.ld (WITH header)
        n_threads: 1
        verbose: 1
        version: 1.1.1 (Sep 22 2022 @ 15:06:57)

==> GZIP input file (not BINARY)
> Reading data from file...
> Header found! Skipping line...
==> Calculating MAF for all sites...
==> Getting sites coordinates
==> Launching threads...
==> Waiting for all threads to finish...
==> Freeing memory...

And now have a .ld file with the chi2 column.

I loaded the file, in R, and got:

datalist = lapply(all.ld.files,
                  function(x) read.table(x, 
                                         header=F, 
                                         col.names = c("site1", "site2",
                                                       "dist", "r^2_ExpG",
                                                       "D", "D'", "r^2",
                                                       "sample_size", "maf1", "maf2", 
                                                       "hap00", "hap01", "hap10", "hap11", 
                                                       "hap_maf1", "hap_maf2", "chi2", "loglike", "nIter"
                                                       ))) 

# Calculate p values from chi square values
datalist$pvals = pchisq(datalist$chi2, df = 1, lower.tail = TRUE)

# Adjusting p-values
datalist$pvals.adj = p.adjust(p = datalist$pvals, method = "fdr")

Basically, all the positions that have significant LD have r2 = 0. Not sure if I'm doing something incorrect here.

Here's a glimpse at the output

site1               site2 dist r.2_ExpG         D       D.      r.2 sample_size     maf1     maf2    hap00    hap01    hap10    hap11 hap_maf1 hap_maf2     chi2 loglike nIter       pvals pvals.adj sign r.2.rd dp.rd
1  chr1:96001606 chr2:46021459  Inf 0.037464  0.007195 0.392740 0.040863         482 0.070539 0.019710 0.918336 0.011125 0.061954 0.008585 0.070539 0.019710 0.040863       0     5 0.160197402 0.2306321        0.04  0.39
2  chr1:96001606 chr3:37364972  Inf 0.359250  0.044935 0.697829 0.387171         482 0.070539 0.087137 0.893405 0.036055 0.019458 0.051082 0.070539 0.087137 0.387171       0     5 0.466209795 0.4799218        0.39  0.70
3  chr1:96001606 chr4:17465385  Inf 0.020404  0.004569 0.338483 0.022247         482 0.070539 0.014523 0.920531 0.008929 0.064946 0.005593 0.070539 0.014523 0.022247       0     5 0.118568106 0.2215031        0.02  0.34
4  chr1:96001606 chr5:19425299  Inf 0.071550  0.016190 0.329240 0.079785         482 0.070539 0.052905 0.896477 0.032983 0.050618 0.019921 0.070539 0.052905 0.079785       0     6 0.222411016 0.2730302        0.08  0.33
5  chr1:96001606 chr5:19434057  Inf 0.000381  0.001102 0.026575 0.000434         482 0.070539 0.044606 0.889103 0.040358 0.066291 0.004248 0.070539 0.044606 0.000434       0     7 0.016620861 0.1437960    *  0.00* 0.03*
6  chr1:96001606 chr6:1161490   Inf 0.000607  0.000665 0.068970 0.000657         482 0.070539 0.010373 0.920484 0.008977 0.069143 0.001397 0.070539 0.010373 0.000657       0     5 0.020449147 0.1437960    *  0.00* 0.07*
7  chr1:96001606 chr7:16108831  Inf 0.004083  0.004049 0.077763 0.004728         482 0.070539 0.056017 0.881444 0.048016 0.062539 0.008000 0.070539 0.056017 0.004728       0     7 0.054819703 0.1801618        0.00  0.08
8  chr1:96001606 chr8:33199177  Inf 0.037418  0.008408 0.335390 0.041084         482 0.070539 0.026971 0.912800 0.016661 0.060229 0.010310 0.070539 0.026971 0.041084       0     5 0.160624133 0.2306321        0.04  0.34
[...]
22 chr10:46021459 chr3:33200027  Inf 0.000022  0.000095 0.004952 0.000020         482 0.019710 0.023859 0.956997 0.023293 0.019144 0.000566 0.019710 0.023859 0.000020       0     3 0.003568236 0.1088302    *  0.00*  0.00*
30 chr9:37364972 chr4:19434057  Inf 0.004494 -0.003790 0.975188 0.004238         482 0.087137 0.044606 0.868354 0.044509 0.087040 0.000096 0.087137 0.044606 0.004238       0    19 0.051905558 0.1801618          0.00 0.98

Also, why sometimes D' is =1 when r2 is =0 ? And at other times, D' is =0 and r2 is =0... see lines 6, 22, and 30 for example.

I'm wondering if there is not something odd going on here:

ngsLD/ngsLD.cpp

Line 305 in 7001165

Dp = D / ( D < 0 ? -min(maf[0]*maf[1], (1-maf[0])*(1-maf[1])) : min(maf[0]*(1-maf[1]), (1-maf[0])*maf[1]) );

Also, why is the loglike hard coded as '0.0'?

ngsLD/ngsLD.cpp

Line 348 in 7001165

0.0,

@fgvieira
Copy link
Owner

Basically, all the positions that have significant LD have r2 = 0. Not sure if I'm doing something incorrect here.

It might be because of low allele frequencies; did you check sites with higher frequencies?

Also, why sometimes D' is =1 when r2 is =0 ? And at other times, D' is =0 and r2 is =0... see lines 6, 22, and 30 for example.

I'm wondering if there is not something odd going on here:

ngsLD/ngsLD.cpp

Line 305 in 7001165

Dp = D / ( D < 0 ? -min(maf[0]*maf[1], (1-maf[0])*(1-maf[1])) : min(maf[0]*(1-maf[1]), (1-maf[0])*maf[1]) );

I can see what you mean, but it is possible for Dprime to be 1 while r^2 is close to 0 (see e.g. last slide). I quickly doubled checked the formulas and they seem correct (see here), but let me know if you find an error.

Also, why is the doglike hard coded as '0.0'?

ngsLD/ngsLD.cpp

Line 348 in 7001165

0.0,

Outputting the final log likelihood has not been fully implemented yet, so that is just a place holder (for now).

@beausoleilmo
Copy link
Author

beausoleilmo commented Aug 17, 2023

Just noted that pchisq(datalist$chi2, df = 1, lower.tail = TRUE), might be more appropriate with lower.tail = FALSE. By 'low allele frequencies' do you mean that both maf1 and maf2 should be high for significance to be achieved or you mean freq_A and freq_B?

dfdf$freq_A = dfdf$hap00 + dfdf$hap01
dfdf$freq_B = dfdf$hap00 + dfdf$hap10

These frequencies are actually quite high in this data.
0.929461 0.980290

I'm not that familiar with C code.

But, actually, if I look here: https://github.com/fgvieira/ngsLD/blob/afb35a355cc28926c1588ab8a643c0e65941f579/ngsLD.cpp#L330C1-L332C1

Shouldn't the frequency be maf1 mfa2 and 1-maf1 and 1-maf2? This is an attempt to calculate the expected heterozygotes and from the R^2

## r^2 = D^2/(PA*PB*Pa*Pb)
# r2.stat = (D.stat / sqrt(datalist$maf1 * datalist$maf2 * (1-datalist$maf1) * (1-datalist$maf2)))^2

So it seems that maf1 and maf2 are PA, PB, Pa and Pb. But later in the frequency calculation, it uses the genotype frequencies (h00, h01, h10, h11). Maybe I'm misunderstanding something here, but it seems that there is a variable misplaced here. Is that correct?

Edit: No it's the same! so no problem here!

@beausoleilmo
Copy link
Author

beausoleilmo commented Aug 17, 2023

I was curious in looking at the raw beagle file and try to get a sense of the allele frequency in the file.

Basically, I wrote a naive (without model) script that finds the maximum probability for each genotype in each individual. When there is a 0.5 0.5 probability, I just take a random one (I only had a few). I then used the genotypes to get the number of each allele and divide by the total number of alleles (nb of individuals*2).

> all1/all.tot = 0.3568301 
> all2/all.tot = 0.6431699 

But when I look at the MAF from the ngsLD output file, I have "maf2 = 0.026971" for the same position... check line 8 for "chr8:33199177". How could that be?

@fgvieira
Copy link
Owner

Can you send a small example beagle file so I can reproduce your results?

@beausoleilmo
Copy link
Author

I sent the data over email.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants