Skip to content

High memory use when using Python and threads #855

@cjw85

Description

@cjw85

The program align.py uses mappy to align reads in Python using multiple worker threads. After loading the index the memory usage jumps up quickly to >20Gb and then continues to climb steadily through 40Gb an beyond.

This issue was first discovered in bonito and isolated to mappy. The data flow in the example mirrors that in bonito but reduced to using only Python stdlib functionality.

mappy: v2.24
pysam: v0.18 (just for optionally reading fastq inputs)
python: v3.8.6

Run program, creating query sequences from index on the fly

python align.py GCA_000001405.15_GRCh38_no_alt_analysis_set.fasta.mmi --threads 48

or using a directory containing *.fastq* files:

python align.py GCA_000001405.15_GRCh38_no_alt_analysis_set.fasta.mmi --fastq_dir FAQ32498 --threads 48

The inputs I am using are available in the AWS S3 bucket at:

s3://ont-research/misc/mappy-mem/FAQ32498.tar
s3://ont-research/misc/mappy-mem/GCA_000001405.15_GRCh38_no_alt_analysis_set.fasta.mmi

I've not fully ascertained if using lots of threads exacerbates the problem or simply makes the symptom apparent more quickly.

Metadata

Metadata

Assignees

No one assigned

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions