How to run MultiZarrToZarr efficiently?

I have observed that I only get about 3-10% cpu use per python process while running MultiZarrToZarr.

```
Tasks:  10 total,   1 running,   9 sleeping,   0 stopped,   0 zombie
%Cpu(s):  0.3 us,  0.2 sy,  0.0 ni, 99.5 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
MiB Mem :   5943.5 total,   2256.4 free,   1163.2 used,   2524.0 buff/cache
MiB Swap:   1024.0 total,    785.8 free,    238.2 used.   4332.5 avail Mem

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
10393 dstuebe   20   0  671772 145984  36432 S   2.7   2.4   0:05.08 python3
  340 dstuebe   20   0 7275124 676712  18412 S   0.3  11.1   3:13.20 java
10850 dstuebe   20   0    7188   3448   2896 R   0.3   0.1   0:00.02 top
    1 dstuebe   20   0    1020      4      0 S   0.0   0.0   0:01.45 docker-init
```

Looking over the MultiZarrToZarr code and looking at the behavior in profiler I am a bit stumped on how to speed things up ([Full logs](https://gist.github.com/emfdavid/f7372ed46800492b330bf5d7d0afae24))

```
2022-07-19T21:44:11.973Z P:MainProcess T:MainThread INFO:service.py:transform:All done aggregations!
         2150132 function calls (2148840 primitive calls) in 183.545 seconds

   Ordered by: internal time
   List reduced from 869 to 50 due to restriction <50>

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
     2588  154.792    0.060  154.792    0.060 {method 'acquire' of '_thread.lock' objects}
        6    2.850    0.475    2.850    0.475 {method 'read' of '_io.BufferedReader' objects}
    18379    1.509    0.000    5.152    0.000 /Users/dstuebe/.cache/bazel/_bazel_dstuebe/76bbf57da584b86027104686797623fa/external/python3_10_x86_64-unknown-linux-gnu/lib/python3.10/logging/__init__.py:283(__init__)
   245265    1.342    0.000    2.643    0.000 {built-in method builtins.isinstance}
```

It seems like the for loops over the files in each pass of [translate](https://github.com/fsspec/kerchunk/blob/main/kerchunk/combine.py#L439) are intrinsically serial, updating a stateful object in order.

I am not complaining, I just want to make sure I am not missing an opportunity to run this more efficiently.

With #194 working better (at least the tmp files are gone, I still need to assess the memory usage) I should be able to run a large number of processes at once using dask - each one will just take a long time.

Is that the best way forward here for the foreseeable future?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

How to run MultiZarrToZarr efficiently? #200

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

How to run MultiZarrToZarr efficiently? #200

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions