-
Notifications
You must be signed in to change notification settings - Fork 45
authz: protect disk list, disk create endpoints #661
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
NexusRequest::new( | ||
RequestBuilder::new(client, Method::POST, &disks_url) | ||
.body(Some(&new_disk)) | ||
.expect_status(Some(StatusCode::INTERNAL_SERVER_ERROR)), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This line will conflict with #660. Whichever change lands second will need to update this line to StatusCode::SERVICE_UNAVAILABLE
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I expected this test would break after merging with #660, but it didn't. This request fails like this:
[2022-01-31T14:09:36.100487564-08:00] INFO: 8ea62a95-b8bb-4f7b-84eb-3d1416363a93/dropshot_external/8592 on ivanova: request completed (req_id=5151e1a8-8280-41cc-87b4-94d34ca64a2e, method=POST, remote_addr=127.0.0.1:51400, local_addr=127.0.0.1:39771, error_message_external="Internal Server Error", error_message_internal="Failed to create region, unexpected state: Failed", response_code=500)
@smklein is that what you expected in this test?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For this specific test, I actually think this is the right error.
The test is not checking for "what if we don't have enough regions" (as reported in #659) - in that case, we would return SERVICE_UNAVAILABLE
, as we patched in #660.
This test:
- Creates enough datasets for the allocation to succeed,
- It even checks there is enough space to the disk allocation to succeed,
- However, it makes sure that the "crucible agents" attached to the datasets fail when sending the request to allocate.
This is the crux of what we're testing: "What if allocation succeeds, but the request to the crucible agent fails?"
As Nexus has no built-in "retry + re-allocate" mechanism right now, propagating the 500 error doesn't totally seem wrong to me. More generally - what should we do if we make an allocation to a sled that has some fault when we try actually assigning the resource there?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this depend on the nature of the Crucible failure we're simulating? If Crucible itself is failing with the equivalent of a 503, then I'd expect the external request to fail with a 503 as well. If Crucible is failing with its own 500, I could go either way. I guess the question is: do we expect the problem is transient and that retrying might help? If so, I think a 503 is more appropriate. It seems like that's the case here if you think Nexus might handle this by retrying.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To make it super concrete, the state of Crucible is returned through this enum: https://github.com/oxidecomputer/crucible/blob/eda2dabb91228bc20b2b21599bdc05409ecfae68/agent/src/model.rs#L6-L14
Normally, we expect to request a region, and see the state transition from Requested
to Created
. However, from that API, Failed
may also be returned. Frankly I'm unsure if I'd consider this a 500 or 503 from Crucible, as there's no indication whether retrying could potentially fix things.
I could see an argument in favor of 503 from Nexus, regardless of the response from Crucible - if the region allocation fails, we probably should retry, and if Nexus is clever enough (in the vein of RFD 237) it should deprioritize allocations to that failing dataset. I think I can actually see a parallel for instance allocation too - if an error occurs trying to set up propolis, should we tell the caller to retry the whole request again so that we allocate to a different sled?
On the other hand, doing something like this would effectively mask all 500 errors from Sleds, and avoid propagating those errors to callers. Maybe that's okay? But if a 500 error would be returned from all sleds making the allocation (e.g., due to a software bug), masking it as 503 seems dubious.
WDYT? I'm finding it kinda hard to make a call without getting a better idea of what State::Failed
means in this context.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
tl;r: I think we may still want to change this but the more I think about it, there are enough ambiguous cases that the choice isn't super clear. I would probably bias toward 503 but 500 also seems fine and either way we'll be informed by operational experience.
I could see an argument in favor of 503 from Nexus, regardless of the response from Crucible - if the region allocation fails, we probably should retry, and if Nexus is clever enough (in the vein of RFD 237) it should deprioritize allocations to that failing dataset. I think I can actually see a parallel for instance allocation too - if an error occurs trying to set up propolis, should we tell the caller to retry the whole request again so that we allocate to a different sled?
I think so. Also, I think that's true regardless of whether Nexus retries internally or not.
On the other hand, doing something like this would effectively mask all 500 errors from Sleds, and avoid propagating those errors to callers. Maybe that's okay? But if a 500 error would be returned from all sleds making the allocation (e.g., due to a software bug), masking it as 503 seems dubious.
That seems okay to me. The problem is we have no way to know which case we're in. Assuming that the failure is equivalent to a 500 (and so: a bug in Crucible):
- Case 1: if you retry again, it definitely won't succeed. That means the request triggers a buggy code path in all currently-deployed Crucible instances. Why would this happen? Either:
- This request [or set of conditions] has always triggered the bug.
- There's a regression in Crucible and we didn't catch it in the canary phase.
- Every Crucible instance has gotten into a bad state that prevents it from serving this request. For example, maybe it doesn't handle out-of-disk-space properly and they all ran out of disk space. This is really a case of "this request [or set of conditions] has always triggered this bug", in that it's not a regression.
- Case 2: If you retry again, it might succeed. That means the request triggers a buggy code path in only some Crucible instances. So a 503 makes more sense. Example: this could happen if we're updating to a Crucible version with a regression, we're still in the canary phase, and the request got unlucky.
So we don't know what case we're in but we have to make a judgment. Case 2 (we accidentally deployed a regression) seems more likely to me than we suddenly discover some longstanding bug, but the latter definitely happens too! If the Failed
variant doesn't necessarily mean a bug, but could also include cases equivalent to a 503, then I think that's all the more reason to err toward 503. But hopefully all of these cases are uncommon, and we'll be monitoring all of them, and it won't make a huge difference. If it does, we'll have data in hand to better understand what happened and how to treat this. So maybe just punt on this.
description: String::from("sells rainsticks"), | ||
}, | ||
snapshot_id: None, | ||
size: ByteCount::from_gibibytes_u32(1), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is definitely something some tests would like to customize, but I suppose it doesn't matter in the general case.
We don't need to change this now, but I wonder if a builder pattern might accommodate both use-cases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed. It occurred to me while adding this that the analogs for Organization and Project don't take create-time arguments, but that's because there basically aren't any that are useful for testing. If (or when) we find it's useful for Disks it should be easy to add.
NexusRequest::new( | ||
RequestBuilder::new(client, Method::POST, &disks_url) | ||
.body(Some(&new_disk)) | ||
.expect_status(Some(StatusCode::INTERNAL_SERVER_ERROR)), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For this specific test, I actually think this is the right error.
The test is not checking for "what if we don't have enough regions" (as reported in #659) - in that case, we would return SERVICE_UNAVAILABLE
, as we patched in #660.
This test:
- Creates enough datasets for the allocation to succeed,
- It even checks there is enough space to the disk allocation to succeed,
- However, it makes sure that the "crucible agents" attached to the datasets fail when sending the request to allocate.
This is the crux of what we're testing: "What if allocation succeeds, but the request to the crucible agent fails?"
As Nexus has no built-in "retry + re-allocate" mechanism right now, propagating the 500 error doesn't totally seem wrong to me. More generally - what should we do if we make an allocation to a sled that has some fault when we try actually assigning the resource there?
} | ||
|
||
async fn disk_get(client: &ClientTestContext, disk_url: &str) -> Disk { | ||
object_get::<Disk>(client, disk_url).await | ||
} | ||
|
||
async fn disks_list(client: &ClientTestContext, list_url: &str) -> Vec<Disk> { | ||
objects_list_page::<Disk>(client, list_url).await.items | ||
NexusRequest::iter_collection_authn(client, list_url, "", 10) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: 10 is a magic number, could we make this a constant?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure. There are 5-10 other callers, and nearly all of them don't really care what value is provided here. So I've changed iter_collection_authn
to accept an Option<usize>
and put the named constant inside that function. It's used if you pass None
.
Update to Propolis commit 50cb28f5 to obtain the following updates: ``` 50cb28f server: separate statuses of inbound and outbound migrations (#661) 61e1481 Don't return None from BlockData::block_for_req when attachment is paused (#711) 79e243b Build and clippy fixes for Rust 1.79 (#712) 3c0f998 Modernize fw_cfg emulation 7a65be9 Update and prune dependencies ``` 50cb28f5 changes the Propolis instance status and migration status APIs to report both inbound and outbound migration status. Update sled agent accordingly. Tests: `cargo nextest`, smoke test on a dev cluster.
No description provided.