Closed
Description
I have observed that I only get about 3-10% cpu use per python process while running MultiZarrToZarr.
Tasks: 10 total, 1 running, 9 sleeping, 0 stopped, 0 zombie
%Cpu(s): 0.3 us, 0.2 sy, 0.0 ni, 99.5 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
MiB Mem : 5943.5 total, 2256.4 free, 1163.2 used, 2524.0 buff/cache
MiB Swap: 1024.0 total, 785.8 free, 238.2 used. 4332.5 avail Mem
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
10393 dstuebe 20 0 671772 145984 36432 S 2.7 2.4 0:05.08 python3
340 dstuebe 20 0 7275124 676712 18412 S 0.3 11.1 3:13.20 java
10850 dstuebe 20 0 7188 3448 2896 R 0.3 0.1 0:00.02 top
1 dstuebe 20 0 1020 4 0 S 0.0 0.0 0:01.45 docker-init
Looking over the MultiZarrToZarr code and looking at the behavior in profiler I am a bit stumped on how to speed things up (Full logs)
2022-07-19T21:44:11.973Z P:MainProcess T:MainThread INFO:service.py:transform:All done aggregations!
2150132 function calls (2148840 primitive calls) in 183.545 seconds
Ordered by: internal time
List reduced from 869 to 50 due to restriction <50>
ncalls tottime percall cumtime percall filename:lineno(function)
2588 154.792 0.060 154.792 0.060 {method 'acquire' of '_thread.lock' objects}
6 2.850 0.475 2.850 0.475 {method 'read' of '_io.BufferedReader' objects}
18379 1.509 0.000 5.152 0.000 /Users/dstuebe/.cache/bazel/_bazel_dstuebe/76bbf57da584b86027104686797623fa/external/python3_10_x86_64-unknown-linux-gnu/lib/python3.10/logging/__init__.py:283(__init__)
245265 1.342 0.000 2.643 0.000 {built-in method builtins.isinstance}
It seems like the for loops over the files in each pass of translate are intrinsically serial, updating a stateful object in order.
I am not complaining, I just want to make sure I am not missing an opportunity to run this more efficiently.
With #194 working better (at least the tmp files are gone, I still need to assess the memory usage) I should be able to run a large number of processes at once using dask - each one will just take a long time.
Is that the best way forward here for the foreseeable future?
Metadata
Metadata
Assignees
Labels
No labels