Skip to content

authz: protect disk list, disk create endpoints #661

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 11 commits into from
Feb 3, 2022
Merged

Conversation

davepacheco
Copy link
Collaborator

No description provided.

@davepacheco davepacheco changed the base branch from main to authz-tests January 31, 2022 20:40
NexusRequest::new(
RequestBuilder::new(client, Method::POST, &disks_url)
.body(Some(&new_disk))
.expect_status(Some(StatusCode::INTERNAL_SERVER_ERROR)),
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This line will conflict with #660. Whichever change lands second will need to update this line to StatusCode::SERVICE_UNAVAILABLE.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I expected this test would break after merging with #660, but it didn't. This request fails like this:

[2022-01-31T14:09:36.100487564-08:00]  INFO: 8ea62a95-b8bb-4f7b-84eb-3d1416363a93/dropshot_external/8592 on ivanova: request completed (req_id=5151e1a8-8280-41cc-87b4-94d34ca64a2e, method=POST, remote_addr=127.0.0.1:51400, local_addr=127.0.0.1:39771, error_message_external="Internal Server Error", error_message_internal="Failed to create region, unexpected state: Failed", response_code=500)

@smklein is that what you expected in this test?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For this specific test, I actually think this is the right error.

The test is not checking for "what if we don't have enough regions" (as reported in #659) - in that case, we would return SERVICE_UNAVAILABLE, as we patched in #660.

This test:

This is the crux of what we're testing: "What if allocation succeeds, but the request to the crucible agent fails?"

As Nexus has no built-in "retry + re-allocate" mechanism right now, propagating the 500 error doesn't totally seem wrong to me. More generally - what should we do if we make an allocation to a sled that has some fault when we try actually assigning the resource there?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this depend on the nature of the Crucible failure we're simulating? If Crucible itself is failing with the equivalent of a 503, then I'd expect the external request to fail with a 503 as well. If Crucible is failing with its own 500, I could go either way. I guess the question is: do we expect the problem is transient and that retrying might help? If so, I think a 503 is more appropriate. It seems like that's the case here if you think Nexus might handle this by retrying.

Copy link
Collaborator

@smklein smklein Feb 3, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To make it super concrete, the state of Crucible is returned through this enum: https://github.com/oxidecomputer/crucible/blob/eda2dabb91228bc20b2b21599bdc05409ecfae68/agent/src/model.rs#L6-L14

Normally, we expect to request a region, and see the state transition from Requested to Created. However, from that API, Failed may also be returned. Frankly I'm unsure if I'd consider this a 500 or 503 from Crucible, as there's no indication whether retrying could potentially fix things.

I could see an argument in favor of 503 from Nexus, regardless of the response from Crucible - if the region allocation fails, we probably should retry, and if Nexus is clever enough (in the vein of RFD 237) it should deprioritize allocations to that failing dataset. I think I can actually see a parallel for instance allocation too - if an error occurs trying to set up propolis, should we tell the caller to retry the whole request again so that we allocate to a different sled?

On the other hand, doing something like this would effectively mask all 500 errors from Sleds, and avoid propagating those errors to callers. Maybe that's okay? But if a 500 error would be returned from all sleds making the allocation (e.g., due to a software bug), masking it as 503 seems dubious.

WDYT? I'm finding it kinda hard to make a call without getting a better idea of what State::Failed means in this context.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tl;r: I think we may still want to change this but the more I think about it, there are enough ambiguous cases that the choice isn't super clear. I would probably bias toward 503 but 500 also seems fine and either way we'll be informed by operational experience.


I could see an argument in favor of 503 from Nexus, regardless of the response from Crucible - if the region allocation fails, we probably should retry, and if Nexus is clever enough (in the vein of RFD 237) it should deprioritize allocations to that failing dataset. I think I can actually see a parallel for instance allocation too - if an error occurs trying to set up propolis, should we tell the caller to retry the whole request again so that we allocate to a different sled?

I think so. Also, I think that's true regardless of whether Nexus retries internally or not.

On the other hand, doing something like this would effectively mask all 500 errors from Sleds, and avoid propagating those errors to callers. Maybe that's okay? But if a 500 error would be returned from all sleds making the allocation (e.g., due to a software bug), masking it as 503 seems dubious.

That seems okay to me. The problem is we have no way to know which case we're in. Assuming that the failure is equivalent to a 500 (and so: a bug in Crucible):

  • Case 1: if you retry again, it definitely won't succeed. That means the request triggers a buggy code path in all currently-deployed Crucible instances. Why would this happen? Either:
    • This request [or set of conditions] has always triggered the bug.
    • There's a regression in Crucible and we didn't catch it in the canary phase.
    • Every Crucible instance has gotten into a bad state that prevents it from serving this request. For example, maybe it doesn't handle out-of-disk-space properly and they all ran out of disk space. This is really a case of "this request [or set of conditions] has always triggered this bug", in that it's not a regression.
  • Case 2: If you retry again, it might succeed. That means the request triggers a buggy code path in only some Crucible instances. So a 503 makes more sense. Example: this could happen if we're updating to a Crucible version with a regression, we're still in the canary phase, and the request got unlucky.

So we don't know what case we're in but we have to make a judgment. Case 2 (we accidentally deployed a regression) seems more likely to me than we suddenly discover some longstanding bug, but the latter definitely happens too! If the Failed variant doesn't necessarily mean a bug, but could also include cases equivalent to a 503, then I think that's all the more reason to err toward 503. But hopefully all of these cases are uncommon, and we'll be monitoring all of them, and it won't make a huge difference. If it does, we'll have data in hand to better understand what happened and how to treat this. So maybe just punt on this.

@davepacheco davepacheco requested a review from smklein January 31, 2022 20:44
Base automatically changed from authz-tests to main January 31, 2022 20:47
@davepacheco davepacheco mentioned this pull request Jan 31, 2022
71 tasks
description: String::from("sells rainsticks"),
},
snapshot_id: None,
size: ByteCount::from_gibibytes_u32(1),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is definitely something some tests would like to customize, but I suppose it doesn't matter in the general case.

We don't need to change this now, but I wonder if a builder pattern might accommodate both use-cases.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. It occurred to me while adding this that the analogs for Organization and Project don't take create-time arguments, but that's because there basically aren't any that are useful for testing. If (or when) we find it's useful for Disks it should be easy to add.

NexusRequest::new(
RequestBuilder::new(client, Method::POST, &disks_url)
.body(Some(&new_disk))
.expect_status(Some(StatusCode::INTERNAL_SERVER_ERROR)),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For this specific test, I actually think this is the right error.

The test is not checking for "what if we don't have enough regions" (as reported in #659) - in that case, we would return SERVICE_UNAVAILABLE, as we patched in #660.

This test:

This is the crux of what we're testing: "What if allocation succeeds, but the request to the crucible agent fails?"

As Nexus has no built-in "retry + re-allocate" mechanism right now, propagating the 500 error doesn't totally seem wrong to me. More generally - what should we do if we make an allocation to a sled that has some fault when we try actually assigning the resource there?

}

async fn disk_get(client: &ClientTestContext, disk_url: &str) -> Disk {
object_get::<Disk>(client, disk_url).await
}

async fn disks_list(client: &ClientTestContext, list_url: &str) -> Vec<Disk> {
objects_list_page::<Disk>(client, list_url).await.items
NexusRequest::iter_collection_authn(client, list_url, "", 10)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: 10 is a magic number, could we make this a constant?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure. There are 5-10 other callers, and nearly all of them don't really care what value is provided here. So I've changed iter_collection_authn to accept an Option<usize> and put the named constant inside that function. It's used if you pass None.

@davepacheco davepacheco merged commit 1640fca into main Feb 3, 2022
@davepacheco davepacheco deleted the authz-work branch February 3, 2022 18:00
gjcolombo added a commit that referenced this pull request Jun 18, 2024
Update to Propolis commit 50cb28f5 to obtain the following updates:

```
50cb28f server: separate statuses of inbound and outbound migrations (#661)
61e1481 Don't return None from BlockData::block_for_req when attachment is paused (#711)
79e243b Build and clippy fixes for Rust 1.79 (#712)
3c0f998 Modernize fw_cfg emulation
7a65be9 Update and prune dependencies
```

50cb28f5 changes the Propolis instance status and migration status APIs
to report both inbound and outbound migration status. Update sled agent
accordingly.

Tests: `cargo nextest`, smoke test on a dev cluster.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants