Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

crossMapseries vs crossMapParallel #60

Closed
slambrechts opened this issue Jun 29, 2021 · 6 comments
Closed

crossMapseries vs crossMapParallel #60

slambrechts opened this issue Jun 29, 2021 · 6 comments
Assignees
Labels
documentation Additional documentation required question Further information is requested

Comments

@slambrechts
Copy link

Hi,

I tried bash metaGEM.sh -t crossMap -c 48 --local, but it didn't work. I see that there are now 2 different versions of crossMap in the metaGEM core workflow instead. I'm not entirely sure what the difference is between these 2 methods, and was wondering if there is any documentation detailing the differences?

Kind regards,
Sam

@franciscozorrilla
Copy link
Owner

Hi Sam,

Thanks for raising the issue. This topic does indeed need some additional documentation in the wiki, I will add it to the to-do list!

At the moment, you can see some documentation in the Snakefile comment/message sections, for example:

metaGEM/Snakefile

Lines 390 to 406 in d81186a

rule crossMapSeries:
input:
contigs = rules.megahit.output,
reads = f'{config["path"]["root"]}/{config["folder"]["qfiltered"]}'
output:
concoct = directory(f'{config["path"]["root"]}/{config["folder"]["concoct"]}/{{IDs}}/cov'),
metabat = directory(f'{config["path"]["root"]}/{config["folder"]["metabat"]}/{{IDs}}/cov'),
maxbin = directory(f'{config["path"]["root"]}/{config["folder"]["maxbin"]}/{{IDs}}/cov')
benchmark:
f'{config["path"]["root"]}/{config["folder"]["benchmarks"]}/{{IDs}}.crossMapSeries.benchmark.txt'
message:
"""
Cross map in seies:
Use this approach to provide all 3 binning tools with cross-sample coverage information.
Will likely provide superior binning results, but may no be feasible for datasets with
many large samples such as the tara oceans dataset.
"""

metaGEM/Snakefile

Lines 538 to 552 in d81186a

rule crossMapParallel:
input:
index = f'{config["path"]["root"]}/{config["folder"]["kallistoIndex"]}/{{focal}}/index.kaix',
R1 = f'{config["path"]["root"]}/{config["folder"]["qfiltered"]}/{{IDs}}/{{IDs}}_R1.fastq.gz',
R2 = f'{config["path"]["root"]}/{config["folder"]["qfiltered"]}/{{IDs}}/{{IDs}}_R2.fastq.gz'
output:
directory(f'{config["path"]["root"]}/{config["folder"]["kallisto"]}/{{focal}}/{{IDs}}')
benchmark:
f'{config["path"]["root"]}/{config["folder"]["benchmarks"]}/{{focal}}.{{IDs}}.crossMapParallel.benchmark.txt'
message:
"""
This rule is an alternative implementation of crossMapSeries, using kallisto
instead of bwa for mapping operations. This implementation is recommended for
large datasets.
"""

In short, the crossMapSeries rule will launch 1 job per sample, and in each of those jobs it will run a for loop to map each set of qfiltered reads against one assembly. On the other hand, the crossMapParallel rule will submit one job per mapping operation (e.g. mapping reads from sample X against assembly Y). On a more practical note, crossMapSeries requires you to temporarily store bam files equal to the number of samples in your dataset before generating the contig coverage across samples, this can quickly become impractical for large datasets.

As you can see, crossMapSeries is the default option because it can generate contig coverage for all three binners, whereas crossMapParallel is better at scaling mapping operations for large datasets but only generates contig coverage across samples for CONCOCT. In the metaGEM paper we used crossMapParallel for the TARA oceans dataset of 246 paired end samples. In case you have not already, I would recommend reading the methods section of the metaGEM preprint, in particular the Contig coverage estimation and binning subsection, which describes the differences in methods.

You may also find the discussion in issue #57 relevant.
Let me know if you have further questions regarding this topic!

Best wishes,
Francisco

@franciscozorrilla franciscozorrilla self-assigned this Jun 29, 2021
@franciscozorrilla franciscozorrilla added question Further information is requested documentation Additional documentation required labels Jun 29, 2021
@franciscozorrilla franciscozorrilla pinned this issue Jun 29, 2021
@slambrechts
Copy link
Author

Hi Francisco,

Ok I understand now.

On a more practical note, crossMapSeries requires you to temporarily store bam files equal to the number of samples in your dataset before generating the contig coverage across samples, this can quickly become impractical for large datasets.

=> This was going to be my next question. In the scratch intermediate files folder I noticed these temporary crossMap files take up 226 GB for the one sample that has already finished. And indeed, since I have 43 samples, this will become impractical. After crossMapSeries has finished running for a specific sample, can I delete the corresponding map from the scratch intermediate files folder for said sample?

Best Wishes,
Sam

@franciscozorrilla
Copy link
Owner

Absolutely, you can (and probably should) delete most folders in scratch/ after jobs have finished running. This is especially true for the crossMap folder since it will be storing N^2 sorted bam files after finishing, where N = number of samples in your dataset.

Best,
Francisco

@slambrechts
Copy link
Author

ok great, thanks.

So I can also safely delete the assemblies folder from the scratch/ intermediate files folder?

@franciscozorrilla
Copy link
Owner

Yes, you can safely delete the assemblies/ and all other subfolders in scratch/ after the jobs have finished running. The files remaining in the scratch/ subfolders are mostly useful for troubleshooting if jobs fail, or if you want to extract any intermediate result files that are not used directly by metaGEM.

@slambrechts
Copy link
Author

ok clear, thank you

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Additional documentation required question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants