Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enhance daemon states management to improve usability and stability #152

Merged
merged 13 commits into from
Sep 13, 2022

Conversation

changweige
Copy link
Member

@changweige changweige commented Aug 30, 2022

Several community users and nydus contributors have complained about nydus-snapshotter's usability and stability.

  1. nydusd is not ready so the container image can't be accessed
  2. nydus image can't be accessed after rebooting the host

To solve the issues once and for all, this PR refactors nydusd daemons states management. It decouples in-memory states cache from DB persistence storage. Snapshotter never empties the DB when restarting but to rebuild the in-memory daemon states cache to get consistent with DB persistent storage. In addition, it will request API to restarted nydusd daemon to recover the API mounts.

Several E2E cases are accompanied.

Address: #142 #125

@changweige changweige changed the title Recover after restart wip: Recover after restart Aug 30, 2022
@changweige changweige changed the title wip: Recover after restart Enhance daemon states management to improve usability and stability Aug 31, 2022
@changweige changweige force-pushed the recover-after-restart branch 5 times, most recently from 2c733e2 to 154e065 Compare September 7, 2022 02:22
@codecov-commenter
Copy link

codecov-commenter commented Sep 7, 2022

Codecov Report

Base: 30.00% // Head: 30.25% // Increases project coverage by +0.25% 🎉

Coverage data is based on head (f203e5d) compared to base (8a8d51a).
Patch coverage: 0.00% of modified lines in pull request are covered.

Additional details and impacted files
@@            Coverage Diff             @@
##             main     #152      +/-   ##
==========================================
+ Coverage   30.00%   30.25%   +0.25%     
==========================================
  Files          20       20              
  Lines        1650     1626      -24     
==========================================
- Hits          495      492       -3     
+ Misses       1090     1069      -21     
  Partials       65       65              
Impacted Files Coverage Δ
pkg/filesystem/fs/fs.go 1.10% <0.00%> (-0.05%) ⬇️
pkg/store/daemonstore.go 0.00% <0.00%> (ø)
pkg/filesystem/fs/blob_manager.go 44.44% <0.00%> (-11.12%) ⬇️

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report at Codecov.
📢 Do you have feedback about the report comment? Let us know in this issue.

if _, e := fs.getSharedDaemon(); e != nil {
log.G(ctx).Warnf("initialize the shared nydus daemon")
if d := fs.getSharedDaemon(); d != nil && !d.Connected {
log.G(ctx).Infof("initialize the shared nydus daemon")
fs.initSharedDaemon(ctx)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The error needs to be handled.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

if fs.mode == SharedInstance {
if d.ID != daemon.SharedNydusDaemonID && sharedDaemonConnected {
if err := d.SharedMount(); err != nil {
return nil, errors.Errorf("failed to mount rafs when recovering, %v", err)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

return nil, errors.Wrapf(err, "failed to mount rafs when recovering") should be better.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

}
} else {
if !d.Connected {
fs.manager.StartDaemon(d)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

error needs to be handled.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

s.idxBySnapshotID[daemon.SnapshotID] = daemon

if ok {
return old
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need to return old? seems it doesn't to be used.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. This is just a hint to callers. It is up to the caller if they need to check the daemon is previously inserted.
By returning old daemon pointer, caller has a chance to tell it is overwitten.

deleted = victim
continue
}
ds = append(ds, victim)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is the victim added to the survivor list? :)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The matching victim is already skipped. Please check it out on L83.

return old
}

func (s *DaemonStates) RemoveByDaemonID(id string) *daemon.Daemon {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seem it not to be used.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. It is the original semantic that daemon can be removed by its snapshot ID. So I transformed it a bit.

return daemon
}

func (s *DaemonStates) GetBySnapshotID(id string, op func(s *DaemonStates, d *daemon.Daemon)) *daemon.Daemon {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it necessary to put s as a parameter of the callback func?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. For current usage, the callback is to delete an item from daemons states cache. So it has to access the s as daemon states manager

}

listed := make([]*daemon.Daemon, len(s.daemons))
copy(listed, s.daemons)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it necessary to copy one?

Copy link
Member Author

@changweige changweige Sep 9, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. It is protected by the mutex lock. If not copying the slice, we can't say the returned slice is consistent.

Decouple daemon states cache from persistence storage in DB.
So it is easier to do nydusd daemon recover.

Signed-off-by: Changwei Ge <gechangwei@bytedance.com>
Signed-off-by: Changwei Ge <gechangwei@bytedance.com>
Signed-off-by: Changwei Ge <gechangwei@bytedance.com>
Signed-off-by: Changwei Ge <gechangwei@bytedance.com>
Signed-off-by: Changwei Ge <gechangwei@bytedance.com>
Signed-off-by: Changwei Ge <gechangwei@bytedance.com>
Signed-off-by: Changwei Ge <gechangwei@bytedance.com>
Signed-off-by: Changwei Ge <gechangwei@bytedance.com>
Signed-off-by: Changwei Ge <gechangwei@bytedance.com>
Signed-off-by: Changwei Ge <gechangwei@bytedance.com>
Signed-off-by: Changwei Ge <gechangwei@bytedance.com>
Signed-off-by: Changwei Ge <gechangwei@bytedance.com>
Signed-off-by: Changwei Ge <gechangwei@bytedance.com>
@changweige changweige merged commit f8cfe38 into containerd:main Sep 13, 2022
@changweige changweige deleted the recover-after-restart branch September 26, 2022 03:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants