Skip to content

How to run MultiZarrToZarr efficiently? #200

Closed
@emfdavid

Description

@emfdavid

I have observed that I only get about 3-10% cpu use per python process while running MultiZarrToZarr.

Tasks:  10 total,   1 running,   9 sleeping,   0 stopped,   0 zombie
%Cpu(s):  0.3 us,  0.2 sy,  0.0 ni, 99.5 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
MiB Mem :   5943.5 total,   2256.4 free,   1163.2 used,   2524.0 buff/cache
MiB Swap:   1024.0 total,    785.8 free,    238.2 used.   4332.5 avail Mem

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
10393 dstuebe   20   0  671772 145984  36432 S   2.7   2.4   0:05.08 python3
  340 dstuebe   20   0 7275124 676712  18412 S   0.3  11.1   3:13.20 java
10850 dstuebe   20   0    7188   3448   2896 R   0.3   0.1   0:00.02 top
    1 dstuebe   20   0    1020      4      0 S   0.0   0.0   0:01.45 docker-init

Looking over the MultiZarrToZarr code and looking at the behavior in profiler I am a bit stumped on how to speed things up (Full logs)

2022-07-19T21:44:11.973Z P:MainProcess T:MainThread INFO:service.py:transform:All done aggregations!
         2150132 function calls (2148840 primitive calls) in 183.545 seconds

   Ordered by: internal time
   List reduced from 869 to 50 due to restriction <50>

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
     2588  154.792    0.060  154.792    0.060 {method 'acquire' of '_thread.lock' objects}
        6    2.850    0.475    2.850    0.475 {method 'read' of '_io.BufferedReader' objects}
    18379    1.509    0.000    5.152    0.000 /Users/dstuebe/.cache/bazel/_bazel_dstuebe/76bbf57da584b86027104686797623fa/external/python3_10_x86_64-unknown-linux-gnu/lib/python3.10/logging/__init__.py:283(__init__)
   245265    1.342    0.000    2.643    0.000 {built-in method builtins.isinstance}

It seems like the for loops over the files in each pass of translate are intrinsically serial, updating a stateful object in order.

I am not complaining, I just want to make sure I am not missing an opportunity to run this more efficiently.

With #194 working better (at least the tmp files are gone, I still need to assess the memory usage) I should be able to run a large number of processes at once using dask - each one will just take a long time.

Is that the best way forward here for the foreseeable future?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions