‘error writing to connection’ or ‘error reading from connection’ #763
-
Hi, I have been trying to run an R script that uses parallelization in a slum-managed server, but the job keeps failing. I keep getting one of the following errors:
or
I have ran the script successfully both with sequential and multisession plans interactively in R studio and in an R script in a reduced dataset. I have also used the same code with a different data set smaller in size and it worked. So I don't think the code itself is the problem. I looked up the errors and tried increasing the memory for the job and reducing the workers, but that did not help. I also made sure that the objects that I am using are exportable. The job tends to run 2-4 days and then fail due to the connection error. The R script that I am using is the following:
The server that I am using uses SLURM, and my submission script has the following header:
I have tried using 16, 10, 8, 4, 3 and 2 workers, and all of them have failed with one of the errors indicated above. Right now I am running the script with a sequential plan, and it has been running for 8 days so far without crashing. So it seems to be doing good, but it is taking a long time. This is my R session info:
Any help would be greatly appreciated |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment 6 replies
-
Hi. The part of the error message that says "The reason reported was ‘error reading from connection’. Post-mortem diagnostic: No process exists with this PID, i.e. the localhost worker is no longer alive." strongly suggests that the parallel worker was terminated for one reason or the other. Before anything else, here are some observations for me to reason around this. First, you're using: future::plan(multisession, workers = 8) which means you're launching 8 parallel workers on the current machine, i.e. the compute node where Slurm runs your job. This is easier to reason about, compared to, say a set up that parallel across multiple compute nodes. That you use:
fits this - it tells Slurm you want to get a slot on a single compute node. Second, you're also specifying:
So, technically, your could use up to Third, I see you request a slot with in total 360 GiB of memory and a runtime of 480 hours = 20 days;
Fourth, you're specifying: options(future.globals.onReference = "error") which means that you rule out most common cases where there is a risk that you're using objects that cannot be transferred to another R process. Fifth, you're saying "I have ran the script successfully both with sequential and multisession plans interactively in R studio and in an R script in a reduced dataset", which further helps to rule out that you're using non-exportable objects. Sixth, the information about global object sizes in the error message is there just in case there could be a large object that you didn't anticipate. Providing this information has helped others to track down OOM-killing problems. In this case, you have a 6+ GiB object ("‘genotype_matrix’ (6.21 GiB of class ‘numeric’)". This amount is added to the memory consumption of each parallel worker. I'm not sure if that is relevant given that you have requested 360 GiB of memory. Assuming you're not running out of runtime (20 days), one guess is that you might be running out of memory and the Out-of-Memory (OOM) Killer terminates one or more of your parallel processes in order for your job to stay within the 360 GiB of memory it was given. If this is the case, you should be able to see this in the job log files that Slurm produces for you, cf. https://www.c4.ucsf.edu/scheduler/job-summary.html. Do you see anything suspicious in those log files? My best guess is that the Slurm logs and the Slurm accounting data ( You're saying "I have ran the script successfully both with sequential and multisession plans interactively in R studio and in an R script in a reduced dataset." If that was on the same system, I expect you should be able to run this in vanilla R also, taking RStudio out of the equation, i.e. run |
Beta Was this translation helpful? Give feedback.
I have been troubleshooting this and I finally got the job to run successfully!
future.callr::callr
instead ofmultisession
, but the job failed too. I got the following error: