Push the task back on the queue if the daemon crashes? #95
-
Hi @shikokuchuo, I am thinking ahead about my team's cloud computing migration, and it would be extremely cost-effective to use To replicate this scenario locally, I started 2 daemons, assigned one task to each, and terminated one of the daemons mid-task. If I do not use the dispatcher, then the surviving daemon completes both tasks, which is the ideal outcome for my use case. However, when the dispatcher is involved, then the task seems to stay assigned to the listener of the crashed daemon. Would it be possible to push that task back on the queue instead so a different daemon can pick it up (and update the Here is a using mirai::daemons(n = 2L, url = "ws://127.0.0.1:5700")
#> [1] 2
daemon1 <- callr::r_bg(\() mirai::daemon(url = "ws://127.0.0.1:5700/1"))
daemon2 <- callr::r_bg(\() mirai::daemon(url = "ws://127.0.0.1:5700/2"))
Sys.sleep(2)
daemon1$get_pid()
#> [1] 9143
daemon2$get_pid()
#> [1] 9144
mirai::status()
#> $connections
#> [1] 1
#>
#> $daemons
#> i online instance assigned complete
#> ws://127.0.0.1:5700/1 1 1 1 0 0
#> ws://127.0.0.1:5700/2 2 1 1 0 0
# Assign one task to each daemon.
task1 <- mirai::mirai({
Sys.sleep(10)
Sys.getpid()
})
task2 <- mirai::mirai({
Sys.sleep(10)
Sys.getpid()
})
Sys.sleep(2)
# Make the first daemon crash.
daemon1$kill()
#> [1] TRUE
Sys.sleep(1)
mirai::status()
#> $connections
#> [1] 1
#>
#> $daemons
#> i online instance assigned complete
#> ws://127.0.0.1:5700/1 1 0 1 1 0
#> ws://127.0.0.1:5700/2 2 1 1 1 0
# Only one task is dispatched:
start <- as.numeric(proc.time()["elapsed"])
while (mirai::unresolved(task1) || mirai::unresolved(task2)) {
elapsed <- as.numeric(proc.time()["elapsed"]) - start
message(
paste(
task1$data,
task2$data,
elapsed,
sep = " | "
)
)
if (elapsed > 600) {
break
}
Sys.sleep(60)
}
#> NA | NA | 0.0190000000000001
#> NA | 9144 | 60.019
#> NA | 9144 | 120.02
#> NA | 9144 | 180.02
#> NA | 9144 | 240.02
#> NA | 9144 | 300.021
#> NA | 9144 | 360.021
#> NA | 9144 | 420.021
#> NA | 9144 | 480.022
#> NA | 9144 | 540.022
#> NA | 9144 | 600.022
# One of the tasks is still assigned to the crashed daemon.
mirai::status()
#> $connections
#> [1] 1
#>
#> $daemons
#> i online instance assigned complete
#> ws://127.0.0.1:5700/1 1 0 1 1 0
#> ws://127.0.0.1:5700/2 2 1 1 1 1
task1$data
#> 'unresolved' logi NA
task2$data
#> [1] 9144
# The task completes if I relaunch daemon1:
daemon1 <- callr::r_bg(\() mirai::daemon(url = "ws://127.0.0.1:5700/1"))
daemon1$get_pid()
#> [1] 9587
Sys.sleep(2)
mirai::status()
#> $connections
#> [1] 1
#>
#> $daemons
#> i online instance assigned complete
#> ws://127.0.0.1:5700/1 1 1 2 1 0
#> ws://127.0.0.1:5700/2 2 1 1 1 1
Sys.sleep(11)
mirai::status()
#> $connections
#> [1] 1
#>
#> $daemons
#> i online instance assigned complete
#> ws://127.0.0.1:5700/1 1 1 2 1 1
#> ws://127.0.0.1:5700/2 2 1 1 1 1
task1$data
#> [1] 9587
task2$data
#> [1] 9144 |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment 1 reply
-
@wlandau this seems at odds with wlandau/crew#101 and the retry mechanism you already implemented... The current behaviour is not surprising, and also seems to have remained the same throughout - I couldn't find a changelog entry. By design, the crashed task is isolated at the one daemon instance, so (assuming it's bad code) it doesn't go on and crash all 1,000 nodes in your HPC cluster! At the point it's crashed, (where you see assigned > complete and online == 0), you have the option to (i) relaunch the daemon, or (ii) use Was there something else you wanted to try / suggest at the 'mirai' level? |
Beta Was this translation helpful? Give feedback.
@wlandau this seems at odds with wlandau/crew#101 and the retry mechanism you already implemented...
The current behaviour is not surprising, and also seems to have remained the same throughout - I couldn't find a changelog entry. By design, the crashed task is isolated at the one daemon instance, so (assuming it's bad code) it doesn't go on and crash all 1,000 nodes in your HPC cluster!
At the point it's crashed, (where you see assigned > complete and online == 0), you have the option to (i) relaunch the daemon, or (ii) use
saisei(force = TRUE)
to return the task as an 'errorValue'. The consuming application e.g. targets can then contain logic to re-submit the task or handle otherwise.W…