Skip to content

how to support lazy/streaming load with standard sourmash functions: standalone manifests #3023

Closed
@ctb

Description

@AnneliektH asked me yesterday how to best provide a list of metagenome sketches to mgmanysearch (see https://github.com/sourmash-bio/sourmash_plugin_containment_search/). I realized I wasn't 100% sure of the answer, despite having written this:

so the question becomes, how do you provide collections of large metagenomes to manysearch and fastmultigather in a single filename?
And the answer is: manifests. Manifests are a sourmash filetype that contains information about sketches without containing the actual sketch content, and they can be used as "catalogs" of sketch content.

(Part of my confusion was that the text above is being used through Rust functionality, not through standard Python loading functions.)

mgmanysearch uses standard sourmash loading functions, so I thought an investigation would be useful and lead to some add'l sourmash documentation too!

tl;dr don't use pathlists, use manifests.

the script

I wrote the following Python script:

#! /usr/bin/env python
import sys
import sourmash
import time

print(f'opening {sys.argv[1]}')
sys.stdout.flush()
mark = time.time()
idx = sourmash.load_file_as_index(sys.argv[1])
print(f'{time.time() - mark:.3f}s')
sys.stdout.flush()

print(f'selecting {sys.argv[1]}')
sys.stdout.flush()
mark = time.time()
idx = idx.select(ksize=21)
print(f'{time.time() - mark:.3f}s')
sys.stdout.flush()

print("starting...")
sys.stdout.flush()

mark = time.time()
for ss in idx.signatures():
    print(f'loaded {ss.name}')
    print(f'{time.time() - mark:.3f}s')
    sys.stdout.flush()
    mark = time.time()

the execution

and then ran it on a pathlist containing a list of filenames:

opening pathlist.txt
23.715s
selecting pathlist.txt
0.000s
starting...
loaded 139_2
0.000s
loaded 139_1
0.000s
loaded 139_3
0.000s
loaded 139_4
0.000s

and on a manifest generated with sourmash sig collect $(cat pathlist.txt) -o mf.csv -F csv

opening mf.csv
0.009s
selecting mf.csv
0.000s
starting...
loaded 139_1
4.404s
loaded 139_2
6.663s
loaded 139_3
5.751s
loaded 139_4
6.798s

results

When using pathlists, all sketches are loaded at once at the beginning, consuming All The Memory.

When using manifests, all sketches are loaded on demand, not consuming All the Memory.

other thoughts

This is another reason to use .zip files to store sketches, instead of sig.gz files; sig collect will need to load the actual sketches in sig.gz files in order to build the manifest, while the manifest is already available in .zip files.

tl;dr

  • if you have a bunch of big metagenomes to search using (e.g.) mgmanysearch,
  • and you want to make them into a list to search,
  • store them in zip files,
  • and use sig collect to build a manifest across some or all of them,
  • and then use those manifests.

TODO: verify that sig collect loads things on the command line progressively 😅

Related issues:

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions