Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add unverified checkpoint marker #583

Draft
wants to merge 63 commits into
base: master
Choose a base branch
from
Draft
Show file tree
Hide file tree
Changes from 55 commits
Commits
Show all changes
63 commits
Select commit Hold shift + click to select a range
6bac177
unverified marker creation and deletion. ignore unverified checkpoint…
ShuoWangNSL May 6, 2024
a2b2b88
remove_states_below protect verified checkpoints
ShuoWangNSL May 22, 2024
d37aab1
TODOs
ShuoWangNSL May 27, 2024
c8ba97c
rename
ShuoWangNSL May 28, 2024
103f4d9
use unfiltered checkpoint heights
ShuoWangNSL May 28, 2024
921dd36
remove marker after loading
ShuoWangNSL May 28, 2024
04b05aa
be more cautious
ShuoWangNSL May 28, 2024
ff49fd5
fix
ShuoWangNSL May 28, 2024
20d9ff0
do not copy the marker back because readonly files make the next crea…
ShuoWangNSL May 29, 2024
4534dcf
refine
ShuoWangNSL May 29, 2024
1af0935
test
ShuoWangNSL May 29, 2024
d0982d8
propagate errors
ShuoWangNSL May 29, 2024
ddcf44e
catch panic
ShuoWangNSL May 29, 2024
e05a66e
test both state sync and restart
ShuoWangNSL May 30, 2024
67262dd
restart
ShuoWangNSL May 30, 2024
962432e
assert
ShuoWangNSL May 30, 2024
1efe5d5
simplify
ShuoWangNSL May 30, 2024
5ae8b1e
drop state_manager
ShuoWangNSL May 30, 2024
d5c8af3
fix
ShuoWangNSL May 30, 2024
7c6dfac
change
ShuoWangNSL May 30, 2024
e0ce343
cleanup
ShuoWangNSL May 30, 2024
afc94df
cleanup
ShuoWangNSL May 30, 2024
0fdfd09
checkpoint creation test
ShuoWangNSL May 30, 2024
2b3622c
comments
ShuoWangNSL May 31, 2024
4fb05b0
accept existing marker
ShuoWangNSL May 31, 2024
86961f3
fmt
ShuoWangNSL May 31, 2024
3af979f
cleanup
ShuoWangNSL May 31, 2024
e3d6e50
comment
ShuoWangNSL May 31, 2024
31e41eb
fix
ShuoWangNSL May 31, 2024
d32aa7d
assert in test
ShuoWangNSL May 31, 2024
4c7f966
address comments
ShuoWangNSL Jun 3, 2024
3541b2c
improve comments
ShuoWangNSL Jun 3, 2024
d770246
Merge branch 'master' into shuo/checkpoint_unverified_marker
ShuoWangNSL Jun 11, 2024
58774d5
test can_handle_state_sync_and_commit_race_condition
ShuoWangNSL Jun 11, 2024
9d0a14a
debug_assert lsmt status
ShuoWangNSL Jun 12, 2024
206492a
cp_layout.is_checkpoint_verified
ShuoWangNSL Jun 12, 2024
22c15d9
debug assert marker is not in manifest
ShuoWangNSL Jun 12, 2024
6c21f64
Merge branch 'master' into shuo/unverified_marker_test
ShuoWangNSL Jun 13, 2024
0c87d95
state layout tests
ShuoWangNSL Jun 13, 2024
e63ba28
fix tests
ShuoWangNSL Jun 13, 2024
9fe16da
non-empty checkpoint
ShuoWangNSL Jun 19, 2024
329ac45
Merge branch 'master' into shuo/checkpoint_unverified_marker
ShuoWangNSL Jun 19, 2024
4c9fb00
Merge branch 'master' into shuo/checkpoint_unverified_marker
ShuoWangNSL Jun 28, 2024
961c15f
clippy
ShuoWangNSL Jun 28, 2024
d64c210
Merge branch 'master' into shuo/unverified_marker_test
ShuoWangNSL Jul 3, 2024
778a7ff
clippy
ShuoWangNSL Jul 3, 2024
9750c74
Merge branch 'master' into shuo/unverified_marker_test
ShuoWangNSL Jul 8, 2024
68201ad
Merge branch 'master' into shuo/unverified_marker_test
ShuoWangNSL Jul 10, 2024
973c392
remove pub fn checkpoint_untracked
ShuoWangNSL Jul 10, 2024
89e7b0f
add manifest_computation_including_marker_critical error
ShuoWangNSL Jul 10, 2024
ac11bdd
Merge branch 'master' into shuo/unverified_marker_test
ShuoWangNSL Jul 15, 2024
8e994c9
fix tests
ShuoWangNSL Jul 16, 2024
60d6b48
Merge branch 'master' into shuo/unverified_marker_test
ShuoWangNSL Jul 22, 2024
7c08925
add Verification marker and LoadingPolicy
ShuoWangNSL Jul 24, 2024
16921ee
explicitly specify checkpoint_verified
ShuoWangNSL Jul 24, 2024
037fa6a
fix permission
ShuoWangNSL Jul 25, 2024
5fdddf8
Consume CheckpointLayout and convert
ShuoWangNSL Jul 25, 2024
ae045af
InVerification
ShuoWangNSL Jul 25, 2024
72296a8
no simple for workaround for strict type requirements
ShuoWangNSL Jul 25, 2024
74f788e
reset tip to generic
ShuoWangNSL Jul 26, 2024
cd060ea
test cleanup checkpoint()
ShuoWangNSL Jul 26, 2024
2b7be7b
safe transit
ShuoWangNSL Jul 26, 2024
d3366fc
Merge branch 'master' into shuo/unverified_marker_test
ShuoWangNSL Jul 29, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion rs/replicated_state/src/page_map/storage.rs
Original file line number Diff line number Diff line change
Expand Up @@ -181,7 +181,7 @@ impl Storage {
let mut base_overlays = Vec::<OverlayFile>::new();
let mut overlays = Vec::<OverlayFile>::new();
for path in overlay_paths.iter() {
let overlay = OverlayFile::load(path).unwrap();
let overlay = OverlayFile::load(path)?;
let start_page_index = overlay
.index_iter()
.next()
Expand Down
10 changes: 9 additions & 1 deletion rs/state_layout/src/error.rs
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,10 @@ pub enum LayoutError {

/// The state root doesn't have the expected layout. It's either wrong or
/// was corrupted.
CorruptedLayout { path: PathBuf, message: String },
CorruptedLayout {
path: PathBuf,
message: String,
},

/// Checkpoint at the specified height already exists.
AlreadyExists(Height),
Expand All @@ -24,6 +27,8 @@ pub enum LayoutError {

/// Trying to remove the latest checkpoint.
LatestCheckpoint(Height),

CheckpointUnverified(Height),
}

impl fmt::Display for LayoutError {
Expand Down Expand Up @@ -57,6 +62,9 @@ impl fmt::Display for LayoutError {
"Trying to remove the latest checkpoint at height @{}",
height
),
Self::CheckpointUnverified(height) => {
write!(f, "Checkpoint @{} is not verified", height)
}
}
}
}
Expand Down
123 changes: 113 additions & 10 deletions rs/state_layout/src/state_layout.rs
Original file line number Diff line number Diff line change
Expand Up @@ -71,6 +71,8 @@ pub struct RwPolicy<'a, Owner> {
lifetime_tag: PhantomData<&'a Owner>,
}

pub enum Verification {}
ShuoWangNSL marked this conversation as resolved.
Show resolved Hide resolved

pub trait AccessPolicy {
/// `check_dir` specifies what to do the first time we enter a
/// directory while reading/writing a checkpoint.
Expand All @@ -87,11 +89,15 @@ pub trait ReadPolicy: AccessPolicy {}
pub trait WritePolicy: AccessPolicy {}
pub trait ReadWritePolicy: ReadPolicy + WritePolicy {}

pub trait LoadingPolicy: ReadPolicy {}

impl<T> ReadWritePolicy for T where T: ReadPolicy + WritePolicy {}

impl AccessPolicy for ReadOnly {}
impl ReadPolicy for ReadOnly {}

impl LoadingPolicy for ReadOnly {}

impl AccessPolicy for WriteOnly {
/// For `WriteOnly` mode we want to ensure the directory exists
/// when we visit it for the first time as we'll certainly want to
Expand All @@ -116,6 +122,10 @@ impl<'a, T> AccessPolicy for RwPolicy<'a, T> {
impl<'a, T> ReadPolicy for RwPolicy<'a, T> {}
impl<'a, T> WritePolicy for RwPolicy<'a, T> {}

impl AccessPolicy for Verification {}
impl ReadPolicy for Verification {}
impl LoadingPolicy for Verification {}

pub type CompleteCheckpointLayout = CheckpointLayout<ReadOnly>;

/// This struct contains bits of the `ExecutionState` that are not already
Expand Down Expand Up @@ -371,6 +381,14 @@ impl TipHandler {
if path.extension() == Some(OsStr::new("pbuf")) {
// Do not copy protobufs.
CopyInstruction::Skip
} else if path == cp.unverified_checkpoint_marker() {
// With LSMT disabled, the unverified checkpoint marker is still present in the checkpoint at this point.
// We should not copy it back to the tip because it will be created later when promoting the tip as the next checkpoint.
// When we go for asynchronous checkpointing in the future, we should revisit this as the marker file will have a different lifespan.

// With LSMT enabled, the unverified checkpoint marker should already be removed at this point.
debug_assert_eq!(lsmt_storage, FlagStatus::Disabled);
CopyInstruction::Skip
} else if path.extension() == Some(OsStr::new("bin"))
&& lsmt_storage == FlagStatus::Disabled
{
Expand Down Expand Up @@ -495,7 +513,7 @@ impl StateLayout {
thread_pool: &mut Option<scoped_threadpool::Pool>,
) -> Result<(), LayoutError> {
for height in self.checkpoint_heights()? {
let path = self.checkpoint(height)?.raw_path().to_path_buf();
let path = self.checkpoint_verified(height)?.raw_path().to_path_buf();
sync_and_mark_files_readonly(&self.log, &path, &self.metrics, thread_pool.as_mut())
.map_err(|err| LayoutError::IoError {
path,
Expand Down Expand Up @@ -616,7 +634,7 @@ impl StateLayout {
layout: CheckpointLayout<RwPolicy<'_, T>>,
height: Height,
thread_pool: Option<&mut scoped_threadpool::Pool>,
) -> Result<CheckpointLayout<ReadOnly>, LayoutError> {
) -> Result<CheckpointLayout<Verification>, LayoutError> {
debug_assert_eq!(height, layout.height);
let scratchpad = layout.raw_path();
let checkpoints_path = self.checkpoints();
Expand Down Expand Up @@ -647,7 +665,7 @@ impl StateLayout {
message: "Could not sync checkpoints".to_string(),
io_err: err,
})?;
self.checkpoint(height)
self.checkpoint_in_verification(height)
}

pub fn clone_checkpoint(&self, from: Height, to: Height) -> Result<(), LayoutError> {
Expand All @@ -668,9 +686,13 @@ impl StateLayout {
Ok(())
}

// TODO: make this method private.
/// Returns the layout of the checkpoint with the given height (if
/// there is one).
pub fn checkpoint(&self, height: Height) -> Result<CheckpointLayout<ReadOnly>, LayoutError> {
pub fn checkpoint<P>(&self, height: Height) -> Result<CheckpointLayout<P>, LayoutError>
where
P: LoadingPolicy,
{
let cp_name = Self::checkpoint_name(height);
let path = self.checkpoints().join(cp_name);
if !path.exists() {
Expand Down Expand Up @@ -699,20 +721,36 @@ impl StateLayout {
}
}
}
CheckpointLayout::new(path, height, self.clone())
CheckpointLayout::<P>::new(path, height, self.clone())
}

/// Returns the untracked `CheckpointLayout` for the given height (if there is one).
pub fn checkpoint_untracked(
pub fn checkpoint_verified(
&self,
height: Height,
) -> Result<CheckpointLayout<ReadOnly>, LayoutError> {
let cp = self.checkpoint::<ReadOnly>(height)?;
// TODO: move this check to the checkpoint layout constructor.
if !cp.is_checkpoint_verified() {
return Err(LayoutError::CheckpointUnverified(height));
};
Ok(cp)
}

pub fn checkpoint_in_verification(
&self,
height: Height,
) -> Result<CheckpointLayout<Verification>, LayoutError> {
self.checkpoint::<Verification>(height)
}

pub fn checkpoint_verification_status(&self, height: Height) -> Result<bool, LayoutError> {
let cp_name = Self::checkpoint_name(height);
let path = self.checkpoints().join(cp_name);
if !path.exists() {
return Err(LayoutError::NotFound(height));
}
CheckpointLayout::new_untracked(path, height)
let cp = CheckpointLayout::<ReadOnly>::new_untracked(path, height)?;
Ok(cp.is_checkpoint_verified())
}

fn increment_checkpoint_ref_counter(&self, height: Height) {
Expand Down Expand Up @@ -760,8 +798,19 @@ impl StateLayout {
}
}

/// Returns a sorted list of `Height`s for which a checkpoint is available.
/// Returns a sorted list of `Height`s for which a checkpoint is available and verified.
pub fn checkpoint_heights(&self) -> Result<Vec<Height>, LayoutError> {
let checkpoint_heights = self
.unfiltered_checkpoint_heights()?
.into_iter()
.filter(|h| self.checkpoint_verification_status(*h).unwrap_or(false))
.collect();

Ok(checkpoint_heights)
}

/// Returns a sorted list of `Height`s for which a checkpoint is available, regardless of verification status.
pub fn unfiltered_checkpoint_heights(&self) -> Result<Vec<Height>, LayoutError> {
let names = dir_file_names(&self.checkpoints()).map_err(|err| LayoutError::IoError {
path: self.checkpoints(),
message: format!("Failed to get all checkpoints (err kind: {:?})", err.kind()),
Expand Down Expand Up @@ -883,7 +932,7 @@ impl StateLayout {
if heights.is_empty() {
error!(
self.log,
"Trying to remove non-existing checkpoint {}. The CheckpoinLayout was invalid",
"Trying to remove non-existing checkpoint {}. The CheckpointLayout was invalid",
height,
);
self.metrics
Expand Down Expand Up @@ -1416,6 +1465,60 @@ impl<Permissions: AccessPolicy> CheckpointLayout<Permissions> {
pub fn raw_path(&self) -> &Path {
&self.root
}

/// Returns if the checkpoint is marked as unverified or not.
pub fn is_checkpoint_verified(&self) -> bool {
!self.unverified_checkpoint_marker().exists()
}
}

impl<P> CheckpointLayout<P>
where
P: WritePolicy,
{
/// Creates the unverified checkpoint marker.
/// If the marker already exists, this function does nothing and returns `Ok(())`.
pub fn create_unverified_checkpoint_marker(&self) -> Result<(), LayoutError> {
let marker = self.unverified_checkpoint_marker();
if marker.exists() {
return Ok(());
}
open_for_write(&marker)?;
sync_path(&self.root).map_err(|err| LayoutError::IoError {
path: self.root.clone(),
message: "Failed to sync checkpoint directory for the creation of the unverified checkpoint marker".to_string(),
io_err: err,
})
}
}

impl CheckpointLayout<Verification> {
/// Removes the unverified checkpoint marker.
/// If the marker does not exist, this function does nothing and returns `Ok(())`.
pub fn remove_unverified_checkpoint_marker(&self) -> Result<(), LayoutError> {
let marker = self.unverified_checkpoint_marker();
if !marker.exists() {
return Ok(());
}
match std::fs::remove_file(&marker) {
Err(err) if err.kind() != std::io::ErrorKind::NotFound => {
return Err(LayoutError::IoError {
path: marker.to_path_buf(),
message: "failed to remove file from disk".to_string(),
io_err: err,
});
}
_ => {}
}

// Sync the directory to make sure the marker is removed from disk.
// This is strict prerequisite for the manifest computation.
sync_path(&self.root).map_err(|err| LayoutError::IoError {
path: self.root.clone(),
message: "Failed to sync checkpoint directory for the creation of the unverified checkpoint marker".to_string(),
io_err: err,
})
}
}

pub struct PageMapLayout<Permissions: AccessPolicy> {
Expand Down
48 changes: 48 additions & 0 deletions rs/state_layout/src/state_layout/tests.rs
Original file line number Diff line number Diff line change
Expand Up @@ -354,6 +354,54 @@ fn test_last_removal_panics_in_debug() {
});
}

#[test]
fn test_can_remove_unverified_marker_file_twice() {
with_test_replica_logger(|log| {
let tempdir = tmpdir("state_layout");
let root_path = tempdir.path().to_path_buf();
let metrics_registry = ic_metrics::MetricsRegistry::new();
let state_layout = StateLayout::try_new(log, root_path, &metrics_registry).unwrap();

let height = Height::new(1);
let state_sync_scratchpad = state_layout.state_sync_scratchpad(height).unwrap();
let scratchpad_layout =
CheckpointLayout::<RwPolicy<()>>::new_untracked(state_sync_scratchpad, height)
.expect("failed to create checkpoint layout");
// Create at least a file in the scratchpad layout. Otherwise, empty folders can be overridden without errors
// and calling "scratchpad_to_checkpoint" twice will not fail as expected.
File::create(scratchpad_layout.raw_path().join(SYSTEM_METADATA_FILE)).unwrap();

let tip_path = state_layout.tip_path();
let tip = CheckpointLayout::<RwPolicy<()>>::new_untracked(tip_path, height)
.expect("failed to create tip layout");
File::create(tip.raw_path().join(SYSTEM_METADATA_FILE)).unwrap();

// Create marker files in both the scratchpad and tip and try to promote them to a checkpoint.
scratchpad_layout
.create_unverified_checkpoint_marker()
.unwrap();
tip.create_unverified_checkpoint_marker().unwrap();

let checkpoint = state_layout
.scratchpad_to_checkpoint(scratchpad_layout, height, None)
.unwrap();
checkpoint.remove_unverified_checkpoint_marker().unwrap();

// The checkpoint already exists, therefore promoting the tip to checkpoint should fail.
// However, it can still access the checkpoint and try to remove the marker file again from its perspective.
let checkpoint_result = state_layout.scratchpad_to_checkpoint(tip, height, None);
assert!(checkpoint_result.is_err());

let res = state_layout
.checkpoint(height)
.unwrap()
.remove_unverified_checkpoint_marker();
assert!(res.is_ok());

drop(checkpoint);
});
}

#[test]
fn test_canister_id_from_path() {
assert_eq!(
Expand Down
14 changes: 9 additions & 5 deletions rs/state_manager/src/checkpoint.rs
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,9 @@ use ic_replicated_state::{
ExecutionState, ReplicatedState, SchedulerState, SystemState,
};
use ic_replicated_state::{CheckpointLoadingMetrics, Memory};
use ic_state_layout::{CanisterLayout, CanisterStateBits, CheckpointLayout, ReadOnly};
use ic_state_layout::{
CanisterLayout, CanisterStateBits, CheckpointLayout, LoadingPolicy, ReadOnly,
};
use ic_types::batch::RawQueryStats;
use ic_types::{CanisterTimer, Height, Time};
use ic_utils::thread::parallel_map;
Expand Down Expand Up @@ -131,13 +133,15 @@ pub(crate) fn make_checkpoint(
)?
};

cp.remove_unverified_checkpoint_marker()?;

Ok((cp, state, has_downgrade))
}

/// Calls [load_checkpoint] with a newly created thread pool.
/// See [load_checkpoint] for further details.
pub fn load_checkpoint_parallel(
checkpoint_layout: &CheckpointLayout<ReadOnly>,
pub fn load_checkpoint_parallel<T: LoadingPolicy>(
checkpoint_layout: &CheckpointLayout<T>,
own_subnet_type: SubnetType,
metrics: &CheckpointMetrics,
fd_factory: Arc<dyn PageAllocatorFileDescriptor>,
Expand All @@ -155,8 +159,8 @@ pub fn load_checkpoint_parallel(

/// Loads the node state heighted with `height` using the specified
/// directory layout.
pub fn load_checkpoint(
checkpoint_layout: &CheckpointLayout<ReadOnly>,
pub fn load_checkpoint<T: LoadingPolicy>(
checkpoint_layout: &CheckpointLayout<T>,
own_subnet_type: SubnetType,
metrics: &CheckpointMetrics,
thread_pool: Option<&mut scoped_threadpool::Pool>,
Expand Down
Loading
Loading