-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adding new pruning script with faster run times #31
Conversation
Adding script based on Felipe's pruning method but rewritten with graph-tool in python, which performs faster in this case. Does not yet have all options available in the original pruning script.
Hello @zjnolen, Thanks once again for sharing your Python version. Filipe asked me to test it a bit against his Perl version. So, a few things:
So far so good. However, I have noticed that the retained nodes per se are not the same. In my case, I have 20,639 retained nodes, but 2,565 are only found in the Perl output and 2,568 are only found in the Python output. Thus, we suspect there is some difference regarding which nodes the scripts are retaining. Could you please confirm that you also observe this difference? And do you think you could maybe adjust your script to retain nodes in the same way as the Perl version? Another thing is that I cannot run your script with the Thanks a lot, George. |
Hi George, Thanks for testing and catching that! Looking at my outputs, I see the same, some sites that are unique to each. What I imagine this relates to is how the scripts decide which node to prune when two nodes are of equal weight. I could imagine a situation where you end up with the same number of nodes at the end this way, but some of them being different sites. Looking back at the perl script, there is more going on with the sorting there when finding the heaviest node than what mine does (finds the heaviest and prunes the first that the graph reports back). I've tried it again having it sort by name, then pruning the first, but this results in just a different number of mismatching sites. I'll have to do a bit of testing to figure out if this is really why. If it is sorting by the index assigned in the perl script whenever it finds nodes of equal weight, I think it may not be possible if graph-tool doesn't perform indexing in the same way as the perl script does. But if it is by some sort of metric that can be constant between the two scripts, there should be a way. If that's not it at all then I'll have to dig some more. For the quoting, do you mean that you have to remove the I'll be out of office for a week or so, so won't be able to get to this right away, but I'll update you once I've had a chance to look closer. Best, |
Moved setting edge weights to 1 for edge count based pruning until after edges are filtered by weight, otherwise they don't get filtered.
Adding conversion of edge weights to integers at a certain precision to match how prune_graph.pl handles it, which makes the max weight calculated in this script match what it would find. Resolve ties by marking sorting the tied nodes alphabetically and picking the first as in the other script. Add a debug option to make seeing what gets pruned easier.
Hey @g-pacheco, I think this should now return the same sites the original does, it's been working on my data in tests so far. It was two things I missed from the original. One was that edge weights were converted to integers and truncated, the other was that when there's a tie for heavier nodes, it prunes the first one alphabetically. Best, |
Hej @zjnolen, I hope you had a great break. Thanks a lot -- I will test it quickly with my data and I will let you know the results. As for the Best regards, George. |
Hello @zjnolen, Sorry for the time it took me to test this. Unfortunately, I am still getting a difference in the nodes prunedIN (the number is again the same). Perl: 142 nodes But there are 25 nodes that are not shared. I am attaching here my test file (1K SNPs). Could you please maybe try to replicate this behaviour? BSG_Lumpfish--AllSamples_Ind30_SNPs.1K.LD.gz Here is how I ran your script:
Many thanks in advance, George. |
Nodes were written to excluded list after removal, leading to wrong nodes being put on the list.
Hey @g-pacheco, Thanks for the testing and test data! Oddly, I could not replicate this. I ran it through both scripts on my end and it returned
Which gave me no lines unique to either file (I also switched over to Thanks, |
Hello @zjnolen, Thanks for the last tweaks. I still get a difference of 25 nodes. May I ask how you precisely run your script? Maybe I am missing some flag? Best, George. |
Hi @g-pacheco, I think I see what's different here, the prune_graph.pl script has had an update since the v1.1.1 release, and these two versions actually return different sites due to the implementation changes. Using the v1.1.1 version, I also get a difference of 25 sites between that and the python script. However, if I use the most recent commit I get a difference of 0 with the python script. I based the python one on the most recent commit, so that makes sense with it matching that one. Do you see the same difference in scripts? These were the commands I used:
Best, |
@zjnolen thanks a lot for your contribution and patience testing it out! 😄 |
Hello @zjnolen, I apologise for this confusion -- I did not know that there was a new version of the Thanks once again for sharing your code. Best regards, George. |
Thanks for all the testing help, think it's in much better shape now! I had not realized it had changed either until the last test 😅 Only noticed when I started also getting differences on my local recent install and our cluster's 1.1.1 install as well. |
Hey @fgvieira! I have finished the rewrite of your pruning method into graph-tool for python as we talked about in #25, and I think will be helpful for some with run time issues. I've tested it on one of my smaller datasets of ~60000 nodes and it runs in a little under an hour, compared to the perl script which took 6 hours. I've also found that this run time difference can increase with the number of nodes. Overall, it seems to result in the same outcome, with the pruning returning the same sites:
Both used similar amounts of RAM, which unfortunately is still a bit high (usually approaching the size of the uncompressed ngsLD output), so I still tend to make beagle files in chunks, prune them, then merge them. This is also good for run time, as even if it can be faster at more nodes, it still can take quite a long time.
I also tried to include most of the options found in your script in this one. I've tested them all, but not thoroughly, and some I'm sure could be coded better, but they at least seem to work. As far as the defaults go, I've been using them in a pipeline I'm working on for awhile now with no problems.
Let me know what your thoughts on this are and if you have any suggestions or tests you'd like done!