-
Notifications
You must be signed in to change notification settings - Fork 110
Description
Is your feature request related to a problem? Please describe.
I've noticed that long running scripts that iteratively write data using amuse.io.write_set_to_file tend to become noticably slower the further in the run they get, even though the workload in each iteration should take constant time. After some digging, I've found that this is a known problem with older HDF5 library versions in h5py. The solution in that Stackoverflow thread suggests to add the parameter libver='latest' to the instantiation of h5py.File to use a more recent version of HDF5 where this issue is fixed.
This can be tested using a simple script:
import numpy as np
from amuse.io import write_set_to_file
import amuse.lab as al
import amuse.units.units as au
import time
particles = al.Particles(5)
particles.x = np.arange(5) | au.pc
t = time.time()
for i in range(1, 10001):
# Emulate some kind of constant-time process that changes particles.x
time.sleep(0.001)
particles.x = (np.arange(5) + i) | au.pc
write_set_to_file(particles.savepoint(i | au.yr), "test.amuse", "amuse", append_to_file=True)
if i % 1000 == 0:
t2 = time.time()
print(f"{i-1000}-{i}: {1000/(t2-t):.1f} iterations per second")
t = t2
On my unedited version of AMUSE, this will output the following:
0-1000: 417.8 iterations per second
1000-2000: 366.1 iterations per second
2000-3000: 327.4 iterations per second
3000-4000: 294.0 iterations per second
4000-5000: 267.4 iterations per second
5000-6000: 245.5 iterations per second
6000-7000: 220.0 iterations per second
7000-8000: 207.4 iterations per second
8000-9000: 200.0 iterations per second
9000-10000: 186.9 iterations per second
However, if I edit amuse.io.store_v2.py to add libver='latest' on lines 743, 745 and 747 according to the Stackoverflow solution, I get the following:
0-1000: 448.1 iterations per second
1000-2000: 439.7 iterations per second
2000-3000: 434.9 iterations per second
3000-4000: 440.3 iterations per second
4000-5000: 429.4 iterations per second
5000-6000: 434.1 iterations per second
6000-7000: 427.1 iterations per second
7000-8000: 439.2 iterations per second
8000-9000: 432.8 iterations per second
9000-10000: 422.9 iterations per second
Which is a significant speed improvement. The output file is also almost twice as small: 60 MB in comparison to the 111 MB of the first run.
Important to note is the reason the Stackoverflow solution claims for this behaviour not being the default behaviour in h5py: compatibility. While an unedited read_set_from_file still seems to read both output files fine, I cannot predict any breaking changes.
Describe the solution you'd like
Given the clear positive effect, but unknown probability of breaking changes, I think adding a flag (e.g. h5py_use_latest_libver) to write_set_to_file would be the correct way to go, where the default (False) would keep the current behaviour and True would add libver='latest' to any underlying h5py.File instantiations. Alternatively, a parameter (e.g. h5py_libver) could be added which directly passes its value to libver in h5py.File so that other features of libver can also be used in case someone would want that.
Additional context
Operating system version: Linux 6.5.9-arch2-1
Compiler version: GCC 13.2.1
Python version: 3.11.5
AMUSE version: commit 6510b63 (24th of September, 2023); effectively latest for the entirety of amuse.io
H5py version: 3.10.0
Numpy version: 1.25.2