Push the task back on the queue if the daemon crashes? #95

wlandau · 2024-01-22T14:11:52Z

wlandau
Jan 22, 2024

I am thinking ahead about my team's cloud computing migration, and it would be extremely cost-effective to use crew.aws.batch with EC2 spot instances (or Fargate spot). Unfortunately, spot instances are prone to interruptions due to fluctuating supply and demand, and this will often cause mirai daemons to terminate at unpredictable times.

To replicate this scenario locally, I started 2 daemons, assigned one task to each, and terminated one of the daemons mid-task. If I do not use the dispatcher, then the surviving daemon completes both tasks, which is the ideal outcome for my use case. However, when the dispatcher is involved, then the task seems to stay assigned to the listener of the crashed daemon. Would it be possible to push that task back on the queue instead so a different daemon can pick it up (and update the assigned counter accordingly)? (If not, it seems like crew may have to return to continuously re-launching "backlogged workers", c.f. #63 (comment).)

Here is a using mirai 0.12.0 and nanonext 0.12.0. I see the same behavior on Mac OS and Ubuntu.

mirai::daemons(n = 2L, url = "ws://127.0.0.1:5700")
#> [1] 2
daemon1 <- callr::r_bg(\() mirai::daemon(url = "ws://127.0.0.1:5700/1"))
daemon2 <- callr::r_bg(\() mirai::daemon(url = "ws://127.0.0.1:5700/2"))
Sys.sleep(2)
daemon1$get_pid()
#> [1] 9143
daemon2$get_pid()
#> [1] 9144
mirai::status()
#> $connections
#> [1] 1
#> 
#> $daemons
#>                       i online instance assigned complete
#> ws://127.0.0.1:5700/1 1      1        1        0        0
#> ws://127.0.0.1:5700/2 2      1        1        0        0

# Assign one task to each daemon.
task1 <- mirai::mirai({
  Sys.sleep(10)
  Sys.getpid()
})
task2 <- mirai::mirai({
  Sys.sleep(10)
  Sys.getpid()
})
Sys.sleep(2)

# Make the first daemon crash.
daemon1$kill()
#> [1] TRUE
Sys.sleep(1)
mirai::status()
#> $connections
#> [1] 1
#> 
#> $daemons
#>                       i online instance assigned complete
#> ws://127.0.0.1:5700/1 1      0        1        1        0
#> ws://127.0.0.1:5700/2 2      1        1        1        0

# Only one task is dispatched:
start <- as.numeric(proc.time()["elapsed"])
while (mirai::unresolved(task1) || mirai::unresolved(task2)) {
  elapsed <- as.numeric(proc.time()["elapsed"]) - start
  message(
    paste(
      task1$data,
      task2$data,
      elapsed,
      sep = " | "
    )
  )
  if (elapsed > 600) {
    break
  }
  Sys.sleep(60)
}
#> NA | NA | 0.0190000000000001
#> NA | 9144 | 60.019
#> NA | 9144 | 120.02
#> NA | 9144 | 180.02
#> NA | 9144 | 240.02
#> NA | 9144 | 300.021
#> NA | 9144 | 360.021
#> NA | 9144 | 420.021
#> NA | 9144 | 480.022
#> NA | 9144 | 540.022
#> NA | 9144 | 600.022

# One of the tasks is still assigned to the crashed daemon.
mirai::status()
#> $connections
#> [1] 1
#> 
#> $daemons
#>                       i online instance assigned complete
#> ws://127.0.0.1:5700/1 1      0        1        1        0
#> ws://127.0.0.1:5700/2 2      1        1        1        1

task1$data
#> 'unresolved' logi NA
task2$data
#> [1] 9144

# The task completes if I relaunch daemon1:
daemon1 <- callr::r_bg(\() mirai::daemon(url = "ws://127.0.0.1:5700/1"))
daemon1$get_pid()
#> [1] 9587
Sys.sleep(2)
mirai::status()
#> $connections
#> [1] 1
#> 
#> $daemons
#>                       i online instance assigned complete
#> ws://127.0.0.1:5700/1 1      1        2        1        0
#> ws://127.0.0.1:5700/2 2      1        1        1        1
Sys.sleep(11)
mirai::status()
#> $connections
#> [1] 1
#> 
#> $daemons
#>                       i online instance assigned complete
#> ws://127.0.0.1:5700/1 1      1        2        1        1
#> ws://127.0.0.1:5700/2 2      1        1        1        1
task1$data
#> [1] 9587
task2$data
#> [1] 9144

Answered by shikokuchuo

Jan 23, 2024

@wlandau this seems at odds with wlandau/crew#101 and the retry mechanism you already implemented...

The current behaviour is not surprising, and also seems to have remained the same throughout - I couldn't find a changelog entry. By design, the crashed task is isolated at the one daemon instance, so (assuming it's bad code) it doesn't go on and crash all 1,000 nodes in your HPC cluster!

At the point it's crashed, (where you see assigned > complete and online == 0), you have the option to (i) relaunch the daemon, or (ii) use saisei(force = TRUE) to return the task as an 'errorValue'. The consuming application e.g. targets can then contain logic to re-submit the task or handle otherwise.

W…

View full answer

shikokuchuo · 2024-01-23T11:19:15Z

shikokuchuo
Jan 23, 2024
Maintainer

@wlandau this seems at odds with wlandau/crew#101 and the retry mechanism you already implemented...

The current behaviour is not surprising, and also seems to have remained the same throughout - I couldn't find a changelog entry. By design, the crashed task is isolated at the one daemon instance, so (assuming it's bad code) it doesn't go on and crash all 1,000 nodes in your HPC cluster!

At the point it's crashed, (where you see assigned > complete and online == 0), you have the option to (i) relaunch the daemon, or (ii) use saisei(force = TRUE) to return the task as an 'errorValue'. The consuming application e.g. targets can then contain logic to re-submit the task or handle otherwise.

Was there something else you wanted to try / suggest at the 'mirai' level?

1 reply

wlandau Jan 23, 2024
Author

@wlandau this seems at odds with wlandau/crew#101 and the retry mechanism you already implemented...

wlandau/crew#101 is about imposing a cap on the number of re-launches, but that issue is independent of how or when those re-launches happen in the first place.

The current behaviour is not surprising, and also seems to have remained the same throughout - I couldn't find a changelog entry. By design, the crashed task is isolated at the one daemon instance, so (assuming it's bad code) it doesn't go on and crash all 1,000 nodes in your HPC cluster!

That's a really good point. Do we blame the platform for stopping the worker, as in a spot instance interruption, or do we blame the task for crashing it? In the former case, we might get a little performance boost from re-queuing the task. But in the latter case, that task could run rampant.

So I actually like the current behavior of mirai: to keep the task assigned to its original daemon. I will just need to fix crew so backlogged inactive workers always re-launch.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Push the task back on the queue if the daemon crashes? #95

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment 1 reply

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Push the task back on the queue if the daemon crashes? #95

wlandau Jan 22, 2024

Replies: 1 comment · 1 reply

shikokuchuo Jan 23, 2024 Maintainer

wlandau Jan 23, 2024 Author

wlandau
Jan 22, 2024

Replies: 1 comment 1 reply

shikokuchuo
Jan 23, 2024
Maintainer

wlandau Jan 23, 2024
Author