crossMapseries vs crossMapParallel #60

slambrechts · 2021-06-29T00:07:33Z

Hi,

I tried bash metaGEM.sh -t crossMap -c 48 --local, but it didn't work. I see that there are now 2 different versions of crossMap in the metaGEM core workflow instead. I'm not entirely sure what the difference is between these 2 methods, and was wondering if there is any documentation detailing the differences?

Kind regards,
Sam

The text was updated successfully, but these errors were encountered:

franciscozorrilla · 2021-06-29T10:20:21Z

Hi Sam,

Thanks for raising the issue. This topic does indeed need some additional documentation in the wiki, I will add it to the to-do list!

At the moment, you can see some documentation in the Snakefile comment/message sections, for example:

metaGEM/Snakefile

Lines 390 to 406 in d81186a

    
           rule crossMapSeries: 
        
               input: 
        
                   contigs = rules.megahit.output, 
        
                   reads = f'{config["path"]["root"]}/{config["folder"]["qfiltered"]}' 
        
               output: 
        
                   concoct = directory(f'{config["path"]["root"]}/{config["folder"]["concoct"]}/{{IDs}}/cov'), 
        
                   metabat = directory(f'{config["path"]["root"]}/{config["folder"]["metabat"]}/{{IDs}}/cov'), 
        
                   maxbin = directory(f'{config["path"]["root"]}/{config["folder"]["maxbin"]}/{{IDs}}/cov') 
        
               benchmark: 
        
                   f'{config["path"]["root"]}/{config["folder"]["benchmarks"]}/{{IDs}}.crossMapSeries.benchmark.txt' 
        
               message: 
        
                   """ 
        
                   Cross map in seies: 
        
                   Use this approach to provide all 3 binning tools with cross-sample coverage information. 
        
                   Will likely provide superior binning results, but may no be feasible for datasets with  
        
                   many large samples such as the tara oceans dataset.  
        
                   """

metaGEM/Snakefile

Lines 538 to 552 in d81186a

    
           rule crossMapParallel:   
        
               input: 
        
                   index = f'{config["path"]["root"]}/{config["folder"]["kallistoIndex"]}/{{focal}}/index.kaix', 
        
                   R1 = f'{config["path"]["root"]}/{config["folder"]["qfiltered"]}/{{IDs}}/{{IDs}}_R1.fastq.gz', 
        
                   R2 = f'{config["path"]["root"]}/{config["folder"]["qfiltered"]}/{{IDs}}/{{IDs}}_R2.fastq.gz' 
        
               output: 
        
                   directory(f'{config["path"]["root"]}/{config["folder"]["kallisto"]}/{{focal}}/{{IDs}}') 
        
               benchmark: 
        
                   f'{config["path"]["root"]}/{config["folder"]["benchmarks"]}/{{focal}}.{{IDs}}.crossMapParallel.benchmark.txt' 
        
               message: 
        
                   """ 
        
                   This rule is an alternative implementation of crossMapSeries, using kallisto  
        
                   instead of bwa for mapping operations. This implementation is recommended for 
        
                   large datasets. 
        
                   """

In short, the crossMapSeries rule will launch 1 job per sample, and in each of those jobs it will run a for loop to map each set of qfiltered reads against one assembly. On the other hand, the crossMapParallel rule will submit one job per mapping operation (e.g. mapping reads from sample X against assembly Y). On a more practical note, crossMapSeries requires you to temporarily store bam files equal to the number of samples in your dataset before generating the contig coverage across samples, this can quickly become impractical for large datasets.

As you can see, crossMapSeries is the default option because it can generate contig coverage for all three binners, whereas crossMapParallel is better at scaling mapping operations for large datasets but only generates contig coverage across samples for CONCOCT. In the metaGEM paper we used crossMapParallel for the TARA oceans dataset of 246 paired end samples. In case you have not already, I would recommend reading the methods section of the metaGEM preprint, in particular the Contig coverage estimation and binning subsection, which describes the differences in methods.

You may also find the discussion in issue #57 relevant.
Let me know if you have further questions regarding this topic!

Best wishes,
Francisco

slambrechts · 2021-06-29T12:50:08Z

Hi Francisco,

Ok I understand now.

On a more practical note, crossMapSeries requires you to temporarily store bam files equal to the number of samples in your dataset before generating the contig coverage across samples, this can quickly become impractical for large datasets.

=> This was going to be my next question. In the scratch intermediate files folder I noticed these temporary crossMap files take up 226 GB for the one sample that has already finished. And indeed, since I have 43 samples, this will become impractical. After crossMapSeries has finished running for a specific sample, can I delete the corresponding map from the scratch intermediate files folder for said sample?

Best Wishes,
Sam

franciscozorrilla · 2021-06-29T13:16:41Z

Absolutely, you can (and probably should) delete most folders in scratch/ after jobs have finished running. This is especially true for the crossMap folder since it will be storing N^2 sorted bam files after finishing, where N = number of samples in your dataset.

Best,
Francisco

slambrechts · 2021-06-29T13:24:39Z

ok great, thanks.

So I can also safely delete the assemblies folder from the scratch/ intermediate files folder?

franciscozorrilla · 2021-06-29T13:54:43Z

Yes, you can safely delete the assemblies/ and all other subfolders in scratch/ after the jobs have finished running. The files remaining in the scratch/ subfolders are mostly useful for troubleshooting if jobs fail, or if you want to extract any intermediate result files that are not used directly by metaGEM.

slambrechts · 2021-06-29T14:04:36Z

ok clear, thank you

franciscozorrilla self-assigned this Jun 29, 2021

franciscozorrilla added question Further information is requested documentation Additional documentation required labels Jun 29, 2021

franciscozorrilla pinned this issue Jun 29, 2021

slambrechts closed this as completed Jun 29, 2021

yhbae6022 mentioned this issue Mar 18, 2022

Question about using crossMapParallel #102

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

crossMapseries vs crossMapParallel #60

crossMapseries vs crossMapParallel #60

slambrechts commented Jun 29, 2021

franciscozorrilla commented Jun 29, 2021

slambrechts commented Jun 29, 2021

franciscozorrilla commented Jun 29, 2021

slambrechts commented Jun 29, 2021

franciscozorrilla commented Jun 29, 2021

slambrechts commented Jun 29, 2021

crossMapseries vs crossMapParallel #60

crossMapseries vs crossMapParallel #60

Comments

slambrechts commented Jun 29, 2021

franciscozorrilla commented Jun 29, 2021

slambrechts commented Jun 29, 2021

franciscozorrilla commented Jun 29, 2021

slambrechts commented Jun 29, 2021

franciscozorrilla commented Jun 29, 2021

slambrechts commented Jun 29, 2021