Description
What happened?
I first reported "juliacall.JuliaError: TaskFailedException" errors in #759. Having done further tests, I strongly suspect those were actually two separate issues. The domain error that was occurring with sin
or other "unsafe" functions I believe was indeed solved as per the fix applied for that ticket.
However, I keep getting crashes. I tested without using sin
, and still it crashed. Furthermore, I applied the fix from #759 and made sure Julia re-compiled the relevant package etc., but still have crashes after that.
Looking further into the log files, when scrolling up a bit from the stacktrace, I get:
run: error: hpcslurm-computenodeset-1: task 13: Out Of Memory
Worker 15 terminated.
, or similar, as the task number and worker number differ for each crash.
And the julia-xxx-xxxxxx-0000.out log says:
slurmstepd: error: Detected 1 oom_kill event in StepId=10.0. Some of the step tasks have been OOM Killed.
slurmstepd: error: *** STEP 10.0 ON hpcslurm-computenodeset-1 FAILED (non-zero exit code or other failure mode) ***
slurmstepd: error: Failed to send MESSAGE_TASK_EXIT: Connection refused
Can it be that v1.0.0 is more likely to have such memory issues? I never had issues like this before the upgrade.
The crashes occur always roughly in the same timeframe, which would be consistent with a memory issue, because one would expect it to "boil over" at roughly the same time. In my case, that is each time somewhere between 8 and 11 hours. And this is using a VM with 240GB of RAM. This used to be more than enough, and if anything, before the upgrade to v1.0.0 memory usage was usually very low and I was planning to switch to VMs with less RAM to avoid unnecessary costs.
Here is a memory usage graph of a run started this morning. It can be seen that memory usage climbs quite steep. And while there seems to be some garbage collection or other cleanup process, it doesn't make much of a dent and then memory usage continues to climb. Note that the line that is climbing steeply is the line for the usage space (applications) while the ones that remain flat are kernel and disk data.
And when looking at which processes consume the memory, the top users are all Julia workers. See screenshot below, where the heap-size is also visible in the screenshot.
Might this memory issue be due to changes in v1.0.0?
And/or is there an easy fix such as assigning a different memory amount to the processes or to encourage better garbage collection somehow?
I saw something similar in #490 but it's my understanding that was fixed.
I tried using the "heap_size_hint_in_bytes" parameter but this does not solve the issue it seems, see comment with screenshot added to this ticket.
Version
v1.0.0
Operating System
Linux
Package Manager
pip
Interface
Script (i.e., python my_script.py
)
Relevant log output
run: error: hpcslurm-computenodeset-1: task 13: Out Of Memory
Worker 15 terminated.
slurmstepd: error: Detected 1 oom_kill event in StepId=10.0. Some of the step tasks have been OOM Killed.
slurmstepd: error: *** STEP 10.0 ON hpcslurm-computenodeset-1 FAILED (non-zero exit code or other failure mode) ***
slurmstepd: error: Failed to send MESSAGE_TASK_EXIT: Connection refused
Extra Info
I was running in distributed mode (cluster_manager='slurm'), with 30 CPU cores. The dataset has around 2500 records, but only two features. It's unfortunately not possible to share the full Python script I'm using, but here are the main parameters used when calling PySRRegressor:
niterations=10000000,
binary_operators=["+", "-", "*", "/"],
unary_operators=["exp", "sin", "square", "cube", "sqrt"],
procs=30, populations=450,
cluster_manager='slurm',
ncycles_per_iteration=20000,
batching=False,
weight_optimize=0.35,
parsimony=1,
adaptive_parsimony_scaling=1000,
maxsize=35,
parallelism='multiprocessing',
bumper=False
Activity