Skip to content

[BUG]: Memory issue in version 1.0.0? #764

Closed
@GoldenGoldy

Description

@GoldenGoldy

What happened?

I first reported "juliacall.JuliaError: TaskFailedException" errors in #759. Having done further tests, I strongly suspect those were actually two separate issues. The domain error that was occurring with sin or other "unsafe" functions I believe was indeed solved as per the fix applied for that ticket.

However, I keep getting crashes. I tested without using sin, and still it crashed. Furthermore, I applied the fix from #759 and made sure Julia re-compiled the relevant package etc., but still have crashes after that.

Looking further into the log files, when scrolling up a bit from the stacktrace, I get:

run: error: hpcslurm-computenodeset-1: task 13: Out Of Memory
Worker 15 terminated.

, or similar, as the task number and worker number differ for each crash.

And the julia-xxx-xxxxxx-0000.out log says:

slurmstepd: error: Detected 1 oom_kill event in StepId=10.0. Some of the step tasks have been OOM Killed.
slurmstepd: error: *** STEP 10.0 ON hpcslurm-computenodeset-1 FAILED (non-zero exit code or other failure mode) ***
slurmstepd: error: Failed to send MESSAGE_TASK_EXIT: Connection refused

Can it be that v1.0.0 is more likely to have such memory issues? I never had issues like this before the upgrade.

The crashes occur always roughly in the same timeframe, which would be consistent with a memory issue, because one would expect it to "boil over" at roughly the same time. In my case, that is each time somewhere between 8 and 11 hours. And this is using a VM with 240GB of RAM. This used to be more than enough, and if anything, before the upgrade to v1.0.0 memory usage was usually very low and I was planning to switch to VMs with less RAM to avoid unnecessary costs.

Here is a memory usage graph of a run started this morning. It can be seen that memory usage climbs quite steep. And while there seems to be some garbage collection or other cleanup process, it doesn't make much of a dent and then memory usage continues to climb. Note that the line that is climbing steeply is the line for the usage space (applications) while the ones that remain flat are kernel and disk data.
PySR_memory_usage

And when looking at which processes consume the memory, the top users are all Julia workers. See screenshot below, where the heap-size is also visible in the screenshot.
PySR_memory_procs

Might this memory issue be due to changes in v1.0.0?
And/or is there an easy fix such as assigning a different memory amount to the processes or to encourage better garbage collection somehow?

I saw something similar in #490 but it's my understanding that was fixed.

I tried using the "heap_size_hint_in_bytes" parameter but this does not solve the issue it seems, see comment with screenshot added to this ticket.

Version

v1.0.0

Operating System

Linux

Package Manager

pip

Interface

Script (i.e., python my_script.py)

Relevant log output

run: error: hpcslurm-computenodeset-1: task 13: Out Of Memory
Worker 15 terminated.


slurmstepd: error: Detected 1 oom_kill event in StepId=10.0. Some of the step tasks have been OOM Killed.
slurmstepd: error: *** STEP 10.0 ON hpcslurm-computenodeset-1 FAILED (non-zero exit code or other failure mode) ***
slurmstepd: error: Failed to send MESSAGE_TASK_EXIT: Connection refused

Extra Info

I was running in distributed mode (cluster_manager='slurm'), with 30 CPU cores. The dataset has around 2500 records, but only two features. It's unfortunately not possible to share the full Python script I'm using, but here are the main parameters used when calling PySRRegressor:

niterations=10000000,
binary_operators=["+", "-", "*", "/"],
unary_operators=["exp", "sin", "square", "cube", "sqrt"],
procs=30, populations=450,
cluster_manager='slurm',
ncycles_per_iteration=20000,
batching=False,
weight_optimize=0.35,
parsimony=1,
adaptive_parsimony_scaling=1000,
maxsize=35,
parallelism='multiprocessing',
bumper=False

Activity

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Metadata

Assignees

Labels

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions