Skip to content

Getting zpool information fails for faulted devices #1577

@bnaecker

Description

@bnaecker

This function is used to collect information about zpools on the system, which the sled agent then manages as storage devices for things like the databases or Crucible. This collects some of the fields about each pool, such as the total size, or the number of allocated bytes. However, for faulted devices, these fields are not present. Specifically, we'd see:

bnaecker@feldspar : ~/omicron $ zpool list -Hpo name,size,allocated,free,health oxp_d462a7f7-b628-40fe-80ff-4e4189e2d62b
oxp_d462a7f7-b628-40fe-80ff-4e4189e2d62b	-	-	-	FAULTED
bnaecker@feldspar : ~/omicron $

The code in that function attempts to parse the string - as a number, which obviously fails. This can prevent the sled agent from making any further progress, which we can see in the log as:

{"msg":"failed to start sled agent","v":0,"name":"SledAgent","level":40,"time":"2022-08-11T18:17:33.839069674Z","hostname":"feldspar","pid":2455,"component":"BootstrapAgentRssHandler","error":"ServerFailure(\"Sled agent request failed: Error starting sled agent: Could not start sled agent server: Error managing storage: Failed to get info for zpool 'oxp_d462a7f7-b628-40fe-80ff-4e4189e2d62b': Failed to parse output: Failed to parse field 'size': invalid digit found in string\")"}

We probably want to handle this more gracefully, not trying to parse out data if the pool is faulted.

Metadata

Metadata

Assignees

Labels

DebuggingFor when you want better data in debugging an issue (log messages, post mortem debugging, and more)

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions