Skip to content

Improve recovery time in worker failure scenarios #3184

Open
@fjetter

Description

@fjetter

We are operating our distributed clusters in a cloud environment where we need to deal with frequently failing nodes. We usually dispatch jobs automatically and are bound to certain SLAs and therefore expect our jobs to finish in a more or less well defined time.
While distributed offers resilience in terms of graph recalculation we're facing the issue that the recalculation introduces severe performance issues for us.

We are looking for something which would allow us to recover faster in scenarios where individual workers die such that we do not need to recalculate large, expensive chunks of the graph, e.g. by persisting or replicating valuable, small intermediate results.

Ideally the solution would be handled by the scheduler itself, s.t. many different applications can benefit of it (e.g. via a scheduler plugin/extension). We were thinking about milestone/snapshotting where the user can label certain results to be worthy to be repliacated (and later forgotten once another milestone passes/completes). We also discussed some kind of automatic replication based on heuristics (e.g. bytes_result < x and runtime of task > Y -> replicate result) to soften the blow in case of failures.

My questions would be:

  1. Does anybody have additional ideas we should take into account?
  2. Could anybody already gather some experience with similar situations which might help us?
  3. If we would start to implement something like this, what would be the best approach / where should we start?
  4. Are we doing something fundamentally wrong? :)

Researching existing github issues, I only found #2748 which discusses this scenario briefly but is ultimately closed without a proper resolution to this topic. The only solution which is suggested is a caching library but persisting every single result is most likely not an option.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions