-
Notifications
You must be signed in to change notification settings - Fork 39
Description
Dear Matt Olm,
I have a dataset of 169k MAGs. I can't use the default dereplicate parameters in dRep (it would exceed memory). Therefore, I used the following parameters:
dRep dereplicate f__Oscillospiraceae.res -g f__Oscillospiraceae.list -comp 70 -con 5 -d --genomeInfo quality_report_without_0.csv -p 96 --skip_plots -pa 0.8 --low_ram_primary_clustering --primary_chunksize 20000
When running with these parameters, I get the following error:
..:: dRep dereplicate Step 1. Filter ::..
Will filter the genome list
Loading genomes from a list
169,453 genomes were input to dRep
Calculating genome info of genomes
100.00% of genomes passed length filtering
409.86% of genomes passed checkM filtering
..:: dRep dereplicate Step 2. Cluster ::..
Running primary clustering
Running pair-wise MASH clustering
Will split genomes into 9 groups for primary clustering
Traceback (most recent call last):
File "/usr/local/bin/dRep", line 32, in
Controller().parseArguments(args)
File "/usr/local/lib/python3.10/dist-packages/drep/controller.py", line 100, in parseArguments
self.dereplicate_operation(**vars(args))
File "/usr/local/lib/python3.10/dist-packages/drep/controller.py", line 48, in dereplicate_operation
drep.d_workflows.dereplicate_wrapper(kwargs['work_directory'],**kwargs)
File "/usr/local/lib/python3.10/dist-packages/drep/d_workflows.py", line 37, in dereplicate_wrapper
drep.d_cluster.controller.d_cluster_wrapper(wd, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/drep/d_cluster/controller.py", line 184, in d_cluster_wrapper
GenomeClusterController(workDirectory, **kwargs).main()
File "/usr/local/lib/python3.10/dist-packages/drep/d_cluster/controller.py", line 32, in main
self.run_primary_clustering()
File "/usr/local/lib/python3.10/dist-packages/drep/d_cluster/controller.py", line 100, in run_primary_clustering
Mdb, Cdb, cluster_ret = drep.d_cluster.compare_utils.all_vs_all_MASH(self.Bdb, self.wd.get_dir('MASH'), **self.kwargs)
File "/usr/local/lib/python3.10/dist-packages/drep/d_cluster/compare_utils.py", line 110, in all_vs_all_MASH
genome_chunks = run_mash_on_genome_chunks(genome_chunks, mash_exe, sketch_folder, MASH_folder, logdir, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/drep/d_cluster/compare_utils.py", line 180, in run_mash_on_genome_chunks
drep.thread_cmds(cmds, logdir=logdir, t=int(p))
File "/usr/local/lib/python3.10/dist-packages/drep/init.py", line 56, in thread_cmds
pool.map(thread_cmd_wrapper, tups)
File "/usr/lib/python3.10/multiprocessing/pool.py", line 367, in map
return self._map_async(func, iterable, mapstar, chunksize).get()
File "/usr/lib/python3.10/multiprocessing/pool.py", line 774, in get
raise self._value
File "/usr/lib/python3.10/multiprocessing/pool.py", line 125, in worker
result = (True, func(*args, **kwds))
File "/usr/lib/python3.10/multiprocessing/pool.py", line 48, in mapstar
return list(map(*args))
File "/usr/local/lib/python3.10/dist-packages/drep/init.py", line 51, in thread_cmd_wrapper
run_cmd(*tup)
File "/usr/local/lib/python3.10/dist-packages/drep/init.py", line 47, in run_cmd
call(cmd,stdout=sto, stderr=ste)
File "/usr/lib/python3.10/subprocess.py", line 345, in call
with Popen(*popenargs, **kwargs) as p:
File "/usr/lib/python3.10/subprocess.py", line 971, in init
self._execute_child(args, executable, preexec_fn, close_fds,
File "/usr/lib/python3.10/subprocess.py", line 1863, in _execute_child
raise child_exception_type(errno_num, err_msg, err_filename)
OSError: [Errno 7] Argument list too long: '/home/mash-Linux64-v2.3/mash'
How should this situation be resolved? What confuses me is that if I have 700,000 MAGs, how should I input them all at once into dRep for dereplication?
Best Regards!