Skip to content

RBMC: Check again for dead sibling service #69

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from

Conversation

spinler
Copy link
Contributor

@spinler spinler commented Jan 31, 2025

During some bad path testing the sibling daemon on each BMC would make it past the existing check done to make sure it was running and then die. This would cause the wait for the sibling interface to be on D-Bus to time out. At that point each BMC became active since it thought the sibling daemon was fine and just the sibling BMC had the problem.

Fix this by checking again if the sibling daemon is running when the sibling interface still isn't on D-Bus after waiting for it. If it isn't, become passive.

Tested:

This is seen on each BMC:

Waiting for sibling interface and/or heartbeat: Present = False, Heartbeat = False
Done waiting for sibling. Interface present = False, heartbeat = False
Sibling service state is failed
Role = xyz.openbmc_project.State.BMC.Redundancy.Role.Passive due to: Sibling BMC service is not running

During some bad path testing the sibling daemon on each BMC would make
it past the existing check done to make sure it was running and then
die. This would cause the wait for the sibling interface to be on D-Bus
to time out.  At that point each BMC became active since it thought the
sibling daemon was fine and just the sibling BMC had the problem.

Fix this by checking again if the sibling daemon is running when the
sibling interface still isn't on D-Bus after waiting for it.  If it
isn't, become passive.

Tested:

This is seen on each BMC:

```
Waiting for sibling interface and/or heartbeat: Present = False, Heartbeat = False
Done waiting for sibling. Interface present = False, heartbeat = False
Sibling service state is failed
Role = xyz.openbmc_project.State.BMC.Redundancy.Role.Passive due to: Sibling BMC service is not running
```

Signed-off-by: Matt Spinler <spinler@us.ibm.com>
@@ -70,13 +69,20 @@ sdbusplus::async::task<> Manager::startup()
{
co_await sibling->waitForSiblingUp(siblingTimeout);

if (previousRole == Role::Passive)
// Sibling service may have died. Check again.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've looked at this for a while now, and I'm sure it's right, but it just feels like we're starting to work ourselves into the if/else wormhole. With tests now needed for every path and just a lot of complexity. Is that sibling service dying a real use case? Or just something that was only possible with some special error injections?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It happened when the cfam daemon failed accessing the CFAM regs, which I think is a valid fail, and right now I just have the daemon crash when that happens. I did it that way because I couldn't think of how else to alert rbmc manager that the sibling can't get any info from it. If there would be another way for the rbmc manager to know that FSI is broken, then I wouldn't worry about this case here. I'll think a bit more on it.

@spinler
Copy link
Contributor Author

spinler commented Mar 13, 2025

Moved this over to an 1120 PR: #77. Will close this one.

@spinler spinler closed this Mar 13, 2025
@spinler spinler deleted the rbmc_dead_sibling_check branch March 13, 2025 19:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants