Adding new pruning script with faster run times #31

zjnolen · 2022-07-19T10:14:18Z

Hey @fgvieira! I have finished the rewrite of your pruning method into graph-tool for python as we talked about in #25, and I think will be helpful for some with run time issues. I've tested it on one of my smaller datasets of ~60000 nodes and it runs in a little under an hour, compared to the perl script which took 6 hours. I've also found that this run time difference can increase with the number of nodes. Overall, it seems to result in the same outcome, with the pruning returning the same sites:

prune_ngsLD.py.log

Starting with 63706 positions with 2597428 edges between them...
Pruning complete! Pruned graph contains 4042 positions.

prune_graph.pl.log

### Pruning graph (2597428 edges between 63706 nodes)
### Run finished (final graph with 4042 nodes)

Both used similar amounts of RAM, which unfortunately is still a bit high (usually approaching the size of the uncompressed ngsLD output), so I still tend to make beagle files in chunks, prune them, then merge them. This is also good for run time, as even if it can be faster at more nodes, it still can take quite a long time.

I also tried to include most of the options found in your script in this one. I've tested them all, but not thoroughly, and some I'm sure could be coded better, but they at least seem to work. As far as the defaults go, I've been using them in a pipeline I'm working on for awhile now with no problems.

Let me know what your thoughts on this are and if you have any suggestions or tests you'd like done!

Adding script based on Felipe's pruning method but rewritten with graph-tool in python, which performs faster in this case. Does not yet have all options available in the original pruning script.

g-pacheco · 2022-07-20T13:53:00Z

Hello @zjnolen,

Thanks once again for sharing your Python version. Filipe asked me to test it a bit against his Perl version. So, a few things:

I agree that it is indeed much faster // ~12h Vs ~1h:25min in my case;
The numbers of retained nodes are the same.

So far so good. However, I have noticed that the retained nodes per se are not the same. In my case, I have 20,639 retained nodes, but 2,565 are only found in the Perl output and 2,568 are only found in the Python output. Thus, we suspect there is some difference regarding which nodes the scripts are retaining. Could you please confirm that you also observe this difference? And do you think you could maybe adjust your script to retain nodes in the same way as the Perl version?

Another thing is that I cannot run your script with the quoting=csv.QUOTE_NONE option. Thus, my output gets some quotes like this: "NC_046966.1"_5792. May I ask if you would know what could be happing? I have tried different options (e.g. quoting=csv.QUOTE_NONE, quotechar="", escapechar="\\" OR quoting=3 from here), but I could not solve the issue.

Thanks a lot, George.

zjnolen · 2022-07-21T11:31:39Z

Hi George,

Thanks for testing and catching that! Looking at my outputs, I see the same, some sites that are unique to each. What I imagine this relates to is how the scripts decide which node to prune when two nodes are of equal weight. I could imagine a situation where you end up with the same number of nodes at the end this way, but some of them being different sites. Looking back at the perl script, there is more going on with the sorting there when finding the heaviest node than what mine does (finds the heaviest and prunes the first that the graph reports back). I've tried it again having it sort by name, then pruning the first, but this results in just a different number of mismatching sites.

I'll have to do a bit of testing to figure out if this is really why. If it is sorting by the index assigned in the perl script whenever it finds nodes of equal weight, I think it may not be possible if graph-tool doesn't perform indexing in the same way as the perl script does. But if it is by some sort of metric that can be constant between the two scripts, there should be a way. If that's not it at all then I'll have to dig some more.

For the quoting, do you mean that you have to remove the quoting=csv.QUOTE_NONE in the pruned_df.to_csv line to get the code to run? Or is it still outputting with quotes even with that line there?

I'll be out of office for a week or so, so won't be able to get to this right away, but I'll update you once I've had a chance to look closer.

Best,
Zach

Moved setting edge weights to 1 for edge count based pruning until after edges are filtered by weight, otherwise they don't get filtered.

Adding conversion of edge weights to integers at a certain precision to match how prune_graph.pl handles it, which makes the max weight calculated in this script match what it would find. Resolve ties by marking sorting the tied nodes alphabetically and picking the first as in the other script. Add a debug option to make seeing what gets pruned easier.

zjnolen · 2022-08-15T09:54:29Z

Hey @g-pacheco,

I think this should now return the same sites the original does, it's been working on my data in tests so far. It was two things I missed from the original. One was that edge weights were converted to integers and truncated, the other was that when there's a tie for heavier nodes, it prunes the first one alphabetically.

Best,
Zach

g-pacheco · 2022-08-15T10:43:17Z

Hej @zjnolen,

I hope you had a great break. Thanks a lot -- I will test it quickly with my data and I will let you know the results.

As for the "" issue, I think it has to do with the loaded Python version because sometimes I get them and sometimes I do not. So I would not bother with that to be honest. I will try to find out which Python version works well so you know.

Best regards, George.

g-pacheco · 2022-08-17T14:45:07Z

Hello @zjnolen,

Sorry for the time it took me to test this. Unfortunately, I am still getting a difference in the nodes prunedIN (the number is again the same).

Perl: 142 nodes
Python: 142 nodes

But there are 25 nodes that are not shared. I am attaching here my test file (1K SNPs). Could you please maybe try to replicate this behaviour?

BSG_Lumpfish--AllSamples_Ind30_SNPs.1K.LD.gz

Here is how I ran your script:

python New.py --input BSG_Lumpfish--AllSamples_Ind30_SNPs.1K.LD.gz --max_dist 100000 --min_weight 0.2 --output BSG_Lumpfish--AllSamples_Ind30_SNPs.1K.pruneIN --print_excl BSG_Lumpfish--AllSamples_Ind30_SNPs.1K.pruneOUT

Many thanks in advance, George.

Nodes were written to excluded list after removal, leading to wrong nodes being put on the list.

zjnolen · 2022-08-19T13:25:43Z

Hey @g-pacheco,

Thanks for the testing and test data! Oddly, I could not replicate this. I ran it through both scripts on my end and it returned BSG_Lumpfish--AllSamples_Ind30_SNPs.1K.pruneIN with the same sites. For checking I used:

comm -3 <(sort BSG_Lumpfish--AllSamples_Ind30_SNPs.1K.pruneIN.python) <(sort BSG_Lumpfish--AllSamples_Ind30_SNPs.1K.pruneIN.perl)

Which gave me no lines unique to either file (I also switched over to : as the chr:pos separator just to make comparing simpler and to handle contigs with _ in the name better, which might have been why it was quoting in some versions of python if an escape character wasn't given). I did find that the exclude files were different though, and realized that I should have been writing nodes to the dropped list before dropping them, or another node would be written. I've fixed that, and now the exclude files are also the same on my end as well. I've attached the outputs here, are they similar to either the python or perl outputs you get?

pruning_outs.tar.gz

Thanks,
Zach

g-pacheco · 2022-08-19T16:17:48Z

Hello @zjnolen,

Thanks for the last tweaks. I still get a difference of 25 nodes. May I ask how you precisely run your script? Maybe I am missing some flag?

Best, George.

zjnolen · 2022-08-23T12:26:28Z

Hi @g-pacheco,

I think I see what's different here, the prune_graph.pl script has had an update since the v1.1.1 release, and these two versions actually return different sites due to the implementation changes. Using the v1.1.1 version, I also get a difference of 25 sites between that and the python script. However, if I use the most recent commit I get a difference of 0 with the python script. I based the python one on the most recent commit, so that makes sense with it matching that one. Do you see the same difference in scripts? These were the commands I used:

./prune_ngsLD.py --input BSG_Lumpfish--AllSamples_Ind30_SNPs.1K.LD --max_dist 100000 --min_weight 0.2 --output BSG_Lumpfish--AllSamples_Ind30_SNPs.1K.pruneIN.python --print_excl BSG_Lumpfish--AllSamples_Ind30_SNPs.1K.pruneOUT.python

./prune_graph.pl --in_file BSG_Lumpfish--AllSamples_Ind30_SNPs.1K.LD --max_kb_dist 100 --min_weight 0.2 --out BSG_Lumpfish--AllSamples_Ind30_SNPs.1K.pruneIN.perl --print_excl BSG_Lumpfish--AllSamples_Ind30_SNPs.1K.pruneOUT.perl

Best,
Zach

fgvieira · 2022-08-29T10:19:26Z

@zjnolen thanks a lot for your contribution and patience testing it out! 😄

g-pacheco · 2022-08-29T10:20:50Z

Hello @zjnolen,

I apologise for this confusion -- I did not know that there was a new version of the prune_graph.pl script. I can now confirm that the results are indeed identical (I also tried with 10K). I believe Filipe might be accepting your pull request soon.

Thanks once again for sharing your code.

Best regards, George.

zjnolen · 2022-08-30T08:20:52Z

Thanks for all the testing help, think it's in much better shape now! I had not realized it had changed either until the last test 😅 Only noticed when I started also getting differences on my local recent install and our cluster's 1.1.1 install as well.

zjnolen added 6 commits July 18, 2022 15:02

Adding slightly faster implementation of pruning

15c9ccc

Adding script based on Felipe's pruning method but rewritten with graph-tool in python, which performs faster in this case. Does not yet have all options available in the original pruning script.

Add ability to select columns for dist and weight

a12fa8e

Add ability to print dropped positions to a file

322acda

Adding subset as an option for the input data

ea516b6

Making header more descriptive

4290983

Adding prune_ngsLD.py and its reqs to readme

7ac2208

zjnolen mentioned this pull request Jul 19, 2022

Does this pruning method seem appropriate? Plus, uncertainty about pruning beagle format #25

Closed

zjnolen added 2 commits August 12, 2022 16:59

Fix using edge count as weights

8a66e8f

Moved setting edge weights to 1 for edge count based pruning until after edges are filtered by weight, otherwise they don't get filtered.

zjnolen added 2 commits August 19, 2022 15:08

Fix printing excluded nodes

370b2e3

Nodes were written to excluded list after removal, leading to wrong nodes being put on the list.

Switch to chr:pos from chr_pos

f4f99d9

fgvieira merged commit 9e3a716 into fgvieira:master Aug 29, 2022

fgvieira mentioned this pull request Sep 29, 2022

Ideas for speeding up prune_graph.pl #39

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adding new pruning script with faster run times #31

Adding new pruning script with faster run times #31

zjnolen commented Jul 19, 2022

g-pacheco commented Jul 20, 2022

zjnolen commented Jul 21, 2022

zjnolen commented Aug 15, 2022

g-pacheco commented Aug 15, 2022

g-pacheco commented Aug 17, 2022

zjnolen commented Aug 19, 2022

g-pacheco commented Aug 19, 2022

zjnolen commented Aug 23, 2022

fgvieira commented Aug 29, 2022

g-pacheco commented Aug 29, 2022

zjnolen commented Aug 30, 2022

Adding new pruning script with faster run times #31

Adding new pruning script with faster run times #31

Conversation

zjnolen commented Jul 19, 2022

g-pacheco commented Jul 20, 2022

zjnolen commented Jul 21, 2022

zjnolen commented Aug 15, 2022

g-pacheco commented Aug 15, 2022

g-pacheco commented Aug 17, 2022

zjnolen commented Aug 19, 2022

g-pacheco commented Aug 19, 2022

zjnolen commented Aug 23, 2022

fgvieira commented Aug 29, 2022

g-pacheco commented Aug 29, 2022

zjnolen commented Aug 30, 2022