-
Notifications
You must be signed in to change notification settings - Fork 6.8k
Implement actor checkpointing #3839
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Changes from all commits
Commits
Show all changes
38 commits
Select commit
Hold shift + click to select a range
dd587e4
Implement Actor checkpointing
raulchen 4926385
docs
raulchen 61d85f4
fix
raulchen 0de627a
fix
raulchen c61ee1d
fix
raulchen 1a0e818
move restore-from-checkpoint to HandleActorStateTransition
raulchen fb8e9dc
Revert "move restore-from-checkpoint to HandleActorStateTransition"
raulchen 2f446dc
resubmit waiting tasks when actor frontier restored
raulchen b31f374
add doc about num_actor_checkpoints_to_keep=1
raulchen 8700b07
add num_actor_checkpoints_to_keep to Cython
raulchen 17ebcba
add checkpoint_expired api
raulchen f4d7bb3
check if actor class is abstract
raulchen b0ae7dd
change checkpoint_ids to long string
raulchen 6c5c130
implement java
raulchen e5e216f
Refactor to delay actor creation publish until checkpoint is resumed
stephanie-wang 3e5630f
debug, lint
stephanie-wang dd1f405
Erase from checkpoints to restore if task fails
stephanie-wang 683002c
fix lint
raulchen 4b11a3e
update comments
raulchen 40c8c1d
avoid duplicated actor notification log
raulchen 4c450f9
fix unintended change
raulchen a3c4397
add actor_id to checkpoint_expired
raulchen f236a47
small java updates
raulchen f3955e7
make checkpoint info per actor
raulchen b7bae09
lint
stephanie-wang 33f07ab
Remove logging
stephanie-wang e16a4bb
Remove old actor checkpointing Python code, move new checkpointing co…
stephanie-wang c80889a
Replace old actor checkpointing tests
stephanie-wang 8bf4b61
Fix test and lint
stephanie-wang 1d64d55
address comments
raulchen 10999c5
consolidate kill_actor
raulchen ee58192
Merge branch 'master' into actor_checkpoint
stephanie-wang 9c7da6d
Remove __ray_checkpoint__
stephanie-wang 821fb8c
Merge branch 'master' into actor_checkpoint
raulchen e938845
fix non-ascii char
raulchen 6e3985f
Loosen test checks
stephanie-wang c70a499
fix java
raulchen 428d8b5
fix sphinx-build
raulchen File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,99 @@ | ||
package org.ray.api; | ||
|
||
import java.util.List; | ||
import org.ray.api.id.UniqueId; | ||
|
||
public interface Checkpointable { | ||
|
||
class CheckpointContext { | ||
|
||
/** | ||
* Actor's ID. | ||
*/ | ||
public final UniqueId actorId; | ||
/** | ||
* Number of tasks executed since last checkpoint. | ||
*/ | ||
public final int numTasksSinceLastCheckpoint; | ||
/** | ||
* Time elapsed since last checkpoint, in milliseconds. | ||
*/ | ||
public final long timeElapsedMsSinceLastCheckpoint; | ||
|
||
public CheckpointContext(UniqueId actorId, int numTasksSinceLastCheckpoint, | ||
long timeElapsedMsSinceLastCheckpoint) { | ||
this.actorId = actorId; | ||
this.numTasksSinceLastCheckpoint = numTasksSinceLastCheckpoint; | ||
this.timeElapsedMsSinceLastCheckpoint = timeElapsedMsSinceLastCheckpoint; | ||
} | ||
} | ||
|
||
class Checkpoint { | ||
|
||
/** | ||
* Checkpoint's ID. | ||
*/ | ||
public final UniqueId checkpointId; | ||
/** | ||
* Checkpoint's timestamp. | ||
*/ | ||
public final long timestamp; | ||
|
||
public Checkpoint(UniqueId checkpointId, long timestamp) { | ||
this.checkpointId = checkpointId; | ||
this.timestamp = timestamp; | ||
} | ||
} | ||
|
||
/** | ||
* Whether this actor needs to be checkpointed. | ||
* | ||
* This method will be called after every task. You should implement this callback to decide | ||
* whether this actor needs to be checkpointed at this time, based on the checkpoint context, or | ||
* any other factors. | ||
* | ||
* @param checkpointContext An object that contains info about last checkpoint. | ||
* @return A boolean value that indicates whether this actor needs to be checkpointed. | ||
*/ | ||
boolean shouldCheckpoint(CheckpointContext checkpointContext); | ||
|
||
/** | ||
* Save a checkpoint to persistent storage. | ||
* | ||
* If `shouldCheckpoint` returns true, this method will be called. You should implement this | ||
* callback to save actor's checkpoint and the given checkpoint id to persistent storage. | ||
* | ||
* @param actorId Actor's ID. | ||
* @param checkpointId An ID that represents this actor's current state in GCS. You should | ||
* save this checkpoint ID together with actor's checkpoint data. | ||
*/ | ||
void saveCheckpoint(UniqueId actorId, UniqueId checkpointId); | ||
|
||
/** | ||
* Load actor's previous checkpoint, and restore actor's state. | ||
* | ||
* This method will be called when an actor is reconstructed, after actor's constructor. If the | ||
* actor needs to restore from previous checkpoint, this function should restore actor's state and | ||
* return the checkpoint ID. Otherwise, it should do nothing and return null. | ||
* | ||
* @param actorId Actor's ID. | ||
* @param availableCheckpoints A list of available checkpoint IDs and their timestamps, sorted | ||
* by timestamp in descending order. Note, this method must return the ID of one checkpoint in | ||
* this list, or null. Otherwise, an exception will be thrown. | ||
* @return The ID of the checkpoint from which the actor was resumed, or null if the actor should | ||
* restart from the beginning. | ||
*/ | ||
UniqueId loadCheckpoint(UniqueId actorId, List<Checkpoint> availableCheckpoints); | ||
|
||
/** | ||
* Delete an expired checkpoint; | ||
* | ||
* This method will be called when an checkpoint is expired. You should implement this method to | ||
* delete your application checkpoint data. Note, the maximum number of checkpoints kept in the | ||
* backend can be configured at `RayConfig.num_actor_checkpoints_to_keep`. | ||
* | ||
* @param actorId ID of the actor. | ||
* @param checkpointId ID of the checkpoint that has expired. | ||
*/ | ||
void checkpointExpired(UniqueId actorId, UniqueId checkpointId); | ||
} |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.