Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding new pruning script with faster run times #31

Merged
merged 10 commits into from
Aug 29, 2022
Merged

Adding new pruning script with faster run times #31

merged 10 commits into from
Aug 29, 2022

Conversation

zjnolen
Copy link
Contributor

@zjnolen zjnolen commented Jul 19, 2022

Hey @fgvieira! I have finished the rewrite of your pruning method into graph-tool for python as we talked about in #25, and I think will be helpful for some with run time issues. I've tested it on one of my smaller datasets of ~60000 nodes and it runs in a little under an hour, compared to the perl script which took 6 hours. I've also found that this run time difference can increase with the number of nodes. Overall, it seems to result in the same outcome, with the pruning returning the same sites:

prune_ngsLD.py.log

Starting with 63706 positions with 2597428 edges between them...
Pruning complete! Pruned graph contains 4042 positions.
prune_graph.pl.log

### Pruning graph (2597428 edges between 63706 nodes)
### Run finished (final graph with 4042 nodes)

Both used similar amounts of RAM, which unfortunately is still a bit high (usually approaching the size of the uncompressed ngsLD output), so I still tend to make beagle files in chunks, prune them, then merge them. This is also good for run time, as even if it can be faster at more nodes, it still can take quite a long time.

I also tried to include most of the options found in your script in this one. I've tested them all, but not thoroughly, and some I'm sure could be coded better, but they at least seem to work. As far as the defaults go, I've been using them in a pipeline I'm working on for awhile now with no problems.

Let me know what your thoughts on this are and if you have any suggestions or tests you'd like done!

@g-pacheco
Copy link

Hello @zjnolen,

Thanks once again for sharing your Python version. Filipe asked me to test it a bit against his Perl version. So, a few things:

  1. I agree that it is indeed much faster // ~12h Vs ~1h:25min in my case;
  2. The numbers of retained nodes are the same.

So far so good. However, I have noticed that the retained nodes per se are not the same. In my case, I have 20,639 retained nodes, but 2,565 are only found in the Perl output and 2,568 are only found in the Python output. Thus, we suspect there is some difference regarding which nodes the scripts are retaining. Could you please confirm that you also observe this difference? And do you think you could maybe adjust your script to retain nodes in the same way as the Perl version?

Another thing is that I cannot run your script with the quoting=csv.QUOTE_NONE option. Thus, my output gets some quotes like this: "NC_046966.1"_5792. May I ask if you would know what could be happing? I have tried different options (e.g. quoting=csv.QUOTE_NONE, quotechar="", escapechar="\\" OR quoting=3 from here), but I could not solve the issue.

Thanks a lot, George.

@zjnolen
Copy link
Contributor Author

zjnolen commented Jul 21, 2022

Hi George,

Thanks for testing and catching that! Looking at my outputs, I see the same, some sites that are unique to each. What I imagine this relates to is how the scripts decide which node to prune when two nodes are of equal weight. I could imagine a situation where you end up with the same number of nodes at the end this way, but some of them being different sites. Looking back at the perl script, there is more going on with the sorting there when finding the heaviest node than what mine does (finds the heaviest and prunes the first that the graph reports back). I've tried it again having it sort by name, then pruning the first, but this results in just a different number of mismatching sites.

I'll have to do a bit of testing to figure out if this is really why. If it is sorting by the index assigned in the perl script whenever it finds nodes of equal weight, I think it may not be possible if graph-tool doesn't perform indexing in the same way as the perl script does. But if it is by some sort of metric that can be constant between the two scripts, there should be a way. If that's not it at all then I'll have to dig some more.

For the quoting, do you mean that you have to remove the quoting=csv.QUOTE_NONE in the pruned_df.to_csv line to get the code to run? Or is it still outputting with quotes even with that line there?

I'll be out of office for a week or so, so won't be able to get to this right away, but I'll update you once I've had a chance to look closer.

Best,
Zach

Moved setting edge weights to 1 for edge count based pruning until
after edges are filtered by weight, otherwise they don't get filtered.
Adding conversion of edge weights to integers at a certain precision to
match how prune_graph.pl handles it, which makes the max weight
calculated in this script match what it would find.

Resolve ties by marking sorting the tied nodes alphabetically and
picking the first as in the other script.

Add a debug option to make seeing what gets pruned easier.
@zjnolen
Copy link
Contributor Author

zjnolen commented Aug 15, 2022

Hey @g-pacheco,

I think this should now return the same sites the original does, it's been working on my data in tests so far. It was two things I missed from the original. One was that edge weights were converted to integers and truncated, the other was that when there's a tie for heavier nodes, it prunes the first one alphabetically.

Best,
Zach

@g-pacheco
Copy link

Hej @zjnolen,

I hope you had a great break. Thanks a lot -- I will test it quickly with my data and I will let you know the results.

As for the "" issue, I think it has to do with the loaded Python version because sometimes I get them and sometimes I do not. So I would not bother with that to be honest. I will try to find out which Python version works well so you know.

Best regards, George.

@g-pacheco
Copy link

Hello @zjnolen,

Sorry for the time it took me to test this. Unfortunately, I am still getting a difference in the nodes prunedIN (the number is again the same).

Perl: 142 nodes
Python: 142 nodes

But there are 25 nodes that are not shared. I am attaching here my test file (1K SNPs). Could you please maybe try to replicate this behaviour?

BSG_Lumpfish--AllSamples_Ind30_SNPs.1K.LD.gz

Here is how I ran your script:

python New.py --input BSG_Lumpfish--AllSamples_Ind30_SNPs.1K.LD.gz --max_dist 100000 --min_weight 0.2 --output BSG_Lumpfish--AllSamples_Ind30_SNPs.1K.pruneIN --print_excl BSG_Lumpfish--AllSamples_Ind30_SNPs.1K.pruneOUT

Many thanks in advance, George.

Nodes were written to excluded list after removal, leading to wrong
nodes being put on the list.
@zjnolen
Copy link
Contributor Author

zjnolen commented Aug 19, 2022

Hey @g-pacheco,

Thanks for the testing and test data! Oddly, I could not replicate this. I ran it through both scripts on my end and it returned BSG_Lumpfish--AllSamples_Ind30_SNPs.1K.pruneIN with the same sites. For checking I used:

comm -3 <(sort BSG_Lumpfish--AllSamples_Ind30_SNPs.1K.pruneIN.python) <(sort BSG_Lumpfish--AllSamples_Ind30_SNPs.1K.pruneIN.perl)

Which gave me no lines unique to either file (I also switched over to : as the chr:pos separator just to make comparing simpler and to handle contigs with _ in the name better, which might have been why it was quoting in some versions of python if an escape character wasn't given). I did find that the exclude files were different though, and realized that I should have been writing nodes to the dropped list before dropping them, or another node would be written. I've fixed that, and now the exclude files are also the same on my end as well. I've attached the outputs here, are they similar to either the python or perl outputs you get?

pruning_outs.tar.gz

Thanks,
Zach

@g-pacheco
Copy link

Hello @zjnolen,

Thanks for the last tweaks. I still get a difference of 25 nodes. May I ask how you precisely run your script? Maybe I am missing some flag?

Best, George.

@zjnolen
Copy link
Contributor Author

zjnolen commented Aug 23, 2022

Hi @g-pacheco,

I think I see what's different here, the prune_graph.pl script has had an update since the v1.1.1 release, and these two versions actually return different sites due to the implementation changes. Using the v1.1.1 version, I also get a difference of 25 sites between that and the python script. However, if I use the most recent commit I get a difference of 0 with the python script. I based the python one on the most recent commit, so that makes sense with it matching that one. Do you see the same difference in scripts? These were the commands I used:

./prune_ngsLD.py --input BSG_Lumpfish--AllSamples_Ind30_SNPs.1K.LD --max_dist 100000 --min_weight 0.2 --output BSG_Lumpfish--AllSamples_Ind30_SNPs.1K.pruneIN.python --print_excl BSG_Lumpfish--AllSamples_Ind30_SNPs.1K.pruneOUT.python

./prune_graph.pl --in_file BSG_Lumpfish--AllSamples_Ind30_SNPs.1K.LD --max_kb_dist 100 --min_weight 0.2 --out BSG_Lumpfish--AllSamples_Ind30_SNPs.1K.pruneIN.perl --print_excl BSG_Lumpfish--AllSamples_Ind30_SNPs.1K.pruneOUT.perl

Best,
Zach

@fgvieira fgvieira merged commit 9e3a716 into fgvieira:master Aug 29, 2022
@fgvieira
Copy link
Owner

@zjnolen thanks a lot for your contribution and patience testing it out! 😄

@g-pacheco
Copy link

Hello @zjnolen,

I apologise for this confusion -- I did not know that there was a new version of the prune_graph.pl script. I can now confirm that the results are indeed identical (I also tried with 10K). I believe Filipe might be accepting your pull request soon.

Thanks once again for sharing your code.

Best regards, George.

@zjnolen
Copy link
Contributor Author

zjnolen commented Aug 30, 2022

Thanks for all the testing help, think it's in much better shape now! I had not realized it had changed either until the last test 😅 Only noticed when I started also getting differences on my local recent install and our cluster's 1.1.1 install as well.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants