‘error writing to connection’ or ‘error reading from connection’ #763

ErickNavarroD · 2025-02-12T02:10:38Z

ErickNavarroD
Feb 12, 2025

Hi,

I have been trying to run an R script that uses parallelization in a slum-managed server, but the job keeps failing. I keep getting one of the following errors:

Error in { : 
  task 1 failed - "MultisessionFuture (doFuture-1) failed to call gassign() on cluster RichSOCKnode #1 (PID 3184381 on localhost ‘localhost’). The reason reported was ‘error writing to connection’. Post-mortem diagnostic: No process exists with this PID, i.e. the localhost worker is no longer alive. The total size of the 9 globals exported is 8.47 GiB. The three largest globals are ‘genotype_matrix’ (7.43 GiB of class ‘numeric’), ‘summarized_methyl_VMR’ (1.04 GiB of class ‘list’) and ‘environmental_matrix’ (1.06 MiB of class ‘numeric’)"
Calls: <Anonymous> -> %do% -> <Anonymous>
Execution halted

or

Error in { : 
  task 1 failed - "MultisessionFuture (doFuture-1) failed to receive results from cluster RichSOCKnode #1 (PID 733742 on localhost ‘localhost’). The reason reported was ‘error reading from connection’. Post-mortem diagnostic: No process exists with this PID, i.e. the localhost worker is no longer alive. The total size of the 9 globals exported is 6.68 GiB. The three largest globals are ‘genotype_matrix’ (6.21 GiB of class ‘numeric’), ‘summarized_methyl_VMR’ (476.95 MiB of class ‘list’) and ‘environmental_matrix’ (900.70 KiB of class ‘numeric’)"
Calls: <Anonymous> -> %do% -> <Anonymous>
Execution halted

I have ran the script successfully both with sequential and multisession plans interactively in R studio and in an R script in a reduced dataset. I have also used the same code with a different data set smaller in size and it worked. So I don't think the code itself is the problem. I looked up the errors and tried increasing the memory for the job and reducing the workers, but that did not help. I also made sure that the objects that I am using are exportable. The job tends to run 2-4 days and then fail due to the connection error.

The R script that I am using is the following:

#### ---- Load libraries ---- ####
library(future)
library(tidyverse)
library(RAMEN)
library(data.table)
library(doFuture)
library(foreach)
library(here)
library(relaimpo)

#### ----- Load data sets ----- ####
print("Loading data sets")
(time_loading_start = Sys.time())
...
(time_loading_end = Sys.time())
print(str_c("Hours spent loading the data sets: ", round(difftime(time_loading_end,time_loading_start, units = "hours" ),3)))

#### ----- Set parallel backend ----- ####
print("Setting parallel backend...")
doFuture::registerDoFuture()   # Set the parallel backend
options(future.globals.maxSize= +Inf)
options(future.globals.onReference = "error")
future::plan(multisession, workers = 8) 

#### ----- Conduct the analysis ----- ####
print("Starting analysis...")
(start_analysis = Sys.time())
permutated_results = RAMEN::nullDistGE(
  VMRs_df = VMRs_df,
  genotype_matrix = genotype_matrix,
  environmental_matrix = environmental_matrix,
  summarized_methyl_VMR = summarized_methyl_VMR,
  permutations = 2,
  covariates = covariates,
  seed = 2,
  model_selection = "AIC"
)
write.csv(permutated_results, here("tmp_objects/permutated_results_CB_allSIR_2.csv"))
end_analysis = Sys.time()
print(str_c("Hours spent analyzing the dataset: ", round(difftime(end_analysis,start_analysis, units = "hours" ),3)))

The server that I am using uses SLURM, and my submission script has the following header:

#!/usr/bin/env bash
#SBATCH --mem=360G
#SBATCH --nodes=1
## CPU Usage
#SBATCH --cpus-per-task=30
#SBATCH --time=480:00:00

I have tried using 16, 10, 8, 4, 3 and 2 workers, and all of them have failed with one of the errors indicated above. Right now I am running the script with a sequential plan, and it has been running for 8 days so far without crashing. So it seems to be doing good, but it is taking a long time.
Has anyone found the same issue? Or has any clue on what could be the source of the error?

This is my R session info:

R version 4.2.2 (2022-10-31)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 22.04.2 LTS

Matrix products: default
BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.20.so

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] grid      stats     graphics  grDevices utils     datasets  methods  
[8] base     

other attached packages:
 [1] relaimpo_2.2-6    mitools_2.4       survey_4.1-1      survival_3.5-5   
 [5] Matrix_1.5-4      boot_1.3-28.1     MASS_7.3-58.3     here_1.0.1       
 [9] doFuture_1.0.0    foreach_1.5.2     data.table_1.14.8 RAMEN_1.0.0.9000 
[13] lubridate_1.9.2   forcats_1.0.0     stringr_1.5.0     dplyr_1.1.0      
[17] purrr_1.0.1       readr_2.1.4       tidyr_1.3.0       tibble_3.2.1     
[21] ggplot2_3.4.1     tidyverse_2.0.0   future_1.32.0    

loaded via a namespace (and not attached):
 [1] tidyselect_1.2.0    listenv_0.9.0       splines_4.2.2      
 [4] lattice_0.20-45     colorspace_2.1-0    vctrs_0.6.1        
 [7] generics_0.1.3      utf8_1.2.3          rlang_1.1.0        
[10] pillar_1.9.0        DBI_1.1.3           glue_1.6.2         
[13] withr_2.5.0         rngtools_1.5.2      doRNG_1.8.6        
[16] lifecycle_1.0.3     munsell_0.5.0       gtable_0.3.1       
[19] codetools_0.2-19    tzdb_0.3.0          parallel_4.2.2     
[22] fansi_1.0.4         corpcor_1.6.10      scales_1.2.1       
[25] parallelly_1.35.0   hms_1.1.2           digest_0.6.37      
[28] stringi_1.7.12      rprojroot_2.0.3     cli_3.6.1          
[31] tools_4.2.2         magrittr_2.0.3      future.apply_1.10.0
[34] pkgconfig_2.0.3     ellipsis_0.3.2      timechange_0.2.0   
[37] rstudioapi_0.14     iterators_1.0.14    R6_2.5.1           
[40] globals_0.16.2      compiler_4.2.2

Any help would be greatly appreciated

Answered by ErickNavarroD

Feb 25, 2025

I have been troubleshooting this and I finally got the job to run successfully!

I first tried using future.callr::callr instead of multisession, but the job failed too. I got the following error:

Error in `process_initialize(self, private, command, args, stdin, stdout, …`:
! Native call to `processx_exec` failed
Caused by error in `chain_call(c_processx_exec, command, c(command, args), pty, pty_options, …`:
! Cannot fork when running '/usr/local/lib/R/bin/R' (system error -12, Unknown error -12) @unix/processx.c:533 (processx_exec)
---
Backtrace:
 1. RAMEN::lmGE(selected_variables = selected_variables, summarized_methyl_VMR…
 2. foreach::foreach(VMR_i = iterators::iter(selected_variable…

View full answer

HenrikBengtsson · 2025-02-12T04:00:48Z

HenrikBengtsson
Feb 12, 2025
Maintainer

Hi. The part of the error message that says "The reason reported was ‘error reading from connection’. Post-mortem diagnostic: No process exists with this PID, i.e. the localhost worker is no longer alive." strongly suggests that the parallel worker was terminated for one reason or the other.

Before anything else, here are some observations for me to reason around this. First, you're using:

future::plan(multisession, workers = 8)

which means you're launching 8 parallel workers on the current machine, i.e. the compute node where Slurm runs your job. This is easier to reason about, compared to, say a set up that parallel across multiple compute nodes. That you use:

#SBATCH --nodes=1

fits this - it tells Slurm you want to get a slot on a single compute node.

Second, you're also specifying:

#SBATCH --cpus-per-task=30

So, technically, your could use up to plan(multisession, workers = 30). But as long as you belong the number of CPUs (=30) that that Slurm allots to you, you should be fine.

Third, I see you request a slot with in total 360 GiB of memory and a runtime of 480 hours = 20 days;

#SBATCH --mem=360G
#SBATCH --time=480:00:00

Fourth, you're specifying:

options(future.globals.onReference = "error")

which means that you rule out most common cases where there is a risk that you're using objects that cannot be transferred to another R process.

Fifth, you're saying "I have ran the script successfully both with sequential and multisession plans interactively in R studio and in an R script in a reduced dataset", which further helps to rule out that you're using non-exportable objects.

Sixth, the information about global object sizes in the error message is there just in case there could be a large object that you didn't anticipate. Providing this information has helped others to track down OOM-killing problems. In this case, you have a 6+ GiB object ("‘genotype_matrix’ (6.21 GiB of class ‘numeric’)". This amount is added to the memory consumption of each parallel worker. I'm not sure if that is relevant given that you have requested 360 GiB of memory.

Assuming you're not running out of runtime (20 days), one guess is that you might be running out of memory and the Out-of-Memory (OOM) Killer terminates one or more of your parallel processes in order for your job to stay within the 360 GiB of memory it was given. If this is the case, you should be able to see this in the job log files that Slurm produces for you, cf. https://www.c4.ucsf.edu/scheduler/job-summary.html. Do you see anything suspicious in those log files?

My best guess is that the Slurm logs and the Slurm accounting data (sacct) have more clues on what is going on.

You're saying "I have ran the script successfully both with sequential and multisession plans interactively in R studio and in an R script in a reduced dataset." If that was on the same system, I expect you should be able to run this in vanilla R also, taking RStudio out of the equation, i.e. run Rscript myscript.R. If you can establish that with a small data example, then verify that you can run the same via your script above. That should work. If it does, that's a good clue. If it doesn't, that's also a good clue.

6 replies

HenrikBengtsson Feb 15, 2025
Maintainer

Sounds tricky. Another thing that can help to narrow in on the problem is to focus the "function_2" step and see if you can get that to fail alone.

One approach could be to save selected_variables to an selected_variables.rds file, and as soon as you find a run that fails, make a copy of that specific selected_variables.rds. Then, launch a fresh R process, load selected_variables <- readRDS("selected_variables.rds"), and see if you can get the "function_2" step to crash without running the "function_1" step. If you can get "function_2" to crash also here, then it could be data specific, i.e. the value of selected_variables matters.

HenrikBengtsson Feb 15, 2025
Maintainer

This is the accounting data of one of the failed jobs: ...

Exactly which command and with which outputs did you call to generate this output? What version of Slurm do you run (e.g. sbatch --version)?

ReqMem = 320Gn

Could this possibly be a mistake? Is the Gn unit valid; without know about the details, I'd expect that to be just 320G or 320GB, but not 320Gn. I could be wrong and, it could mean something specific to Slurm.

HenrikBengtsson Feb 15, 2025
Maintainer

One more;

MaxDiskWrite = 398.9GB

This is a significant amount of disk writes. I don't know what's written, but if this is from writing temporary files (in R, tempfile() and tempdir()), you could be running temporary disk space. On most system, the temporary disk space is on a local disk that local to each machine. That is controlled by TMPDIR, which typically points to /tmp or /scratch (common on HPC environment). If 400 GB of temporary files needs to be storage all together, you might run out of disk space.

If that happens, R will typically give an informative error message, which we then should have seen in an error message. But, if there's other non-R code involved, or poorly written C code, then it might be that it just crashes R abruptly, and we get what you see above.

ErickNavarroD Feb 25, 2025
Author

I have been troubleshooting this and I finally got the job to run successfully!

I first tried using future.callr::callr instead of multisession, but the job failed too. I got the following error:

Error in `process_initialize(self, private, command, args, stdin, stdout, …`:
! Native call to `processx_exec` failed
Caused by error in `chain_call(c_processx_exec, command, c(command, args), pty, pty_options, …`:
! Cannot fork when running '/usr/local/lib/R/bin/R' (system error -12, Unknown error -12) @unix/processx.c:533 (processx_exec)
---
Backtrace:
 1. RAMEN::lmGE(selected_variables = selected_variables, summarized_methyl_VMR…
 2. foreach::foreach(VMR_i = iterators::iter(selected_variables, …
 3. e$fun(obj, substitute(ex), parent.frame(), e$data)
 4. future::future(expr, substitute = FALSE, envir = envir, globals = globals_ii, …
 5. future::run(future)
 6. future:::run.Future(future)
 7. future::run(future)
 8. future.callr:::run.CallrFuture(future)
 9. future.callr:::with_stealth_rng({ …
10. base::eval(expr, envir = envir, enclos = baseenv())
11. base::eval(expr, envir = envir, enclos = baseenv())
12. callr::r_bg(func, args = list(globals = globals), stdout = stdout, stderr = stder…
13. r_process$new(options = options)
14. local initialize(...)
15. callr:::rp_init(self, private, super, options)
16. callr:::with_envvar(options$env, do.call(super$initialize, c(list(options$bin, …
17. base::force(code)
18. base::do.call(super$initialize, c(list(options$bin, options$real_cmdargs, …
19. (function (command = NULL, args = character(), stdin = NULL, …
20. processx:::process_initialize(self, private, command, args, stdin, stdout, …
21. processx:::chain_call(c_processx_exec, command, c(command, args), pty, pty_options, …
22. | base::withCallingHandlers(do.call(".Call", list(.NAME, ...)), error = function(e…
23. | base::do.call(".Call", list(.NAME, ...))
24. | base::.handleSimpleError(function (e) …
25. | local h(simpleError(msg, call))
26. | processx:::throw_error(err, parent = e)
Execution halted

Running the jobs sequentially and with 1 worker was successful and reported a memory use of ~32 GB, but took many days.
I tried using doParallel::registerDoParallel(16) instead of doFuture::registerDoFuture(); future::plan(multisession, workers = 16) with 16 workers and the job was completed successfully. I was still curious on why the parallelization failed with doFuture.
As you suggested, I tried running function_2 with the RDS object outputted by function_1 (selected_variables_rds) and this failed to my surprise. Saving the object to a csv, loading the object from the csv and running function_2 worked. So there had to be something with the object in the RDS. After exploring, I found out that when loading the RDS object, and running function_2 with data.frame(selected_variables_rds), the script worked. This is very odd, since class(selected_variables_rds) returned "data.frame", and the data.frame(selected_variables_rds) object looked exactly the same as selected_variables_rds. I don't completely understand, but there was something hidden that made the objects different:

> all.equal(selected_variables_rds, selected_variables)
[1] "Attributes: < Names: 1 string mismatch >"                            
[2] "Attributes: < Length mismatch: comparison on first 2 components >"   
[3] "Attributes: < Component 2: Modes: character, numeric >"              
[4] "Attributes: < Component 2: Lengths: 1, 115128 >"                     
[5] "Attributes: < Component 2: target is character, current is numeric >"

This is odd, since the objects have the same names, length, and class when inspecting them manually. They look the same.

> all.equal(names(selected_variables), names(selected_variables_rds))
[1] TRUE
> all.equal(lengths(selected_variables[,2]), lengths(selected_variables_rds[,2]))
[1] TRUE
> class(selected_variables[,2]) == class(selected_variables_rds[,2])
[1] TRUE

The selected_variables_rds object that is returned from function_1 is constructed inside of the function from a foreach loop as follows:

output = foreach::foreach(VMR_i = iterators::iter(VMRs_df, by = "row"), .combine = "rbind") %dorng%{
...
}

This makes me think that the select_variables_rds object is a data frame with some special hidden characteristics that the doFuture parallel strategies struggle with. But only when the object is big, since I didn't have problems when running this with smaller data sets. Maybe the data frame returned with foreach increases significantly the memory used by the workers compared to a regular data.frame?

I modified function_1 by explicitly making the output a data frame like this return(data.frame(output)), and the original problem disappeared when i ran RAMEN::nullDistGE with future::plan(multisession, workers = 16).
Something interesting is that when comparing the jobs ran with future::plan(multisession, workers = 16) and doParallel::registerDoParallel(16), the doFuture job reported using 30 GB of memory and the doParallel job reported using ~300 GB of memory. So I don't know if SLURM was giving me an inaccurate report of the memory when using future strategies? Since the difference in memory use is quite big.

Those are all the clues I got! With the new modification, the job finishes successfully with the doFuture and doParallel strategies. Do you have any additional thoughts on why the job was failing?

Thank you for your comments, they were very helpful for isolating the problem!

Answer selected by ErickNavarroD

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

‘error writing to connection’ or ‘error reading from connection’ #763

{{title}}

Replies: 1 comment 6 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

‘error writing to connection’ or ‘error reading from connection’ #763

ErickNavarroD Feb 12, 2025

Replies: 1 comment · 6 replies

HenrikBengtsson Feb 12, 2025 Maintainer

HenrikBengtsson Feb 15, 2025 Maintainer

HenrikBengtsson Feb 15, 2025 Maintainer

HenrikBengtsson Feb 15, 2025 Maintainer

ErickNavarroD Feb 25, 2025 Author

ErickNavarroD
Feb 12, 2025

Replies: 1 comment 6 replies

HenrikBengtsson
Feb 12, 2025
Maintainer

HenrikBengtsson Feb 15, 2025
Maintainer

HenrikBengtsson Feb 15, 2025
Maintainer

HenrikBengtsson Feb 15, 2025
Maintainer

ErickNavarroD Feb 25, 2025
Author