authz: protect disk list, disk create endpoints #661

davepacheco · 2022-01-31T20:40:03Z

No description provided.

davepacheco · 2022-01-31T20:43:42Z

nexus/tests/integration_tests/disks.rs

+    NexusRequest::new(
+        RequestBuilder::new(client, Method::POST, &disks_url)
+            .body(Some(&new_disk))
+            .expect_status(Some(StatusCode::INTERNAL_SERVER_ERROR)),


This line will conflict with #660. Whichever change lands second will need to update this line to StatusCode::SERVICE_UNAVAILABLE.

I expected this test would break after merging with #660, but it didn't. This request fails like this:

[2022-01-31T14:09:36.100487564-08:00] INFO: 8ea62a95-b8bb-4f7b-84eb-3d1416363a93/dropshot_external/8592 on ivanova: request completed (req_id=5151e1a8-8280-41cc-87b4-94d34ca64a2e, method=POST, remote_addr=127.0.0.1:51400, local_addr=127.0.0.1:39771, error_message_external="Internal Server Error", error_message_internal="Failed to create region, unexpected state: Failed", response_code=500)

@smklein is that what you expected in this test?

For this specific test, I actually think this is the right error.

The test is not checking for "what if we don't have enough regions" (as reported in #659) - in that case, we would return SERVICE_UNAVAILABLE, as we patched in #660.

This test:

Creates enough datasets for the allocation to succeed,

It even checks there is enough space to the disk allocation to succeed,

However, it makes sure that the "crucible agents" attached to the datasets fail when sending the request to allocate.

This is the crux of what we're testing: "What if allocation succeeds, but the request to the crucible agent fails?"

As Nexus has no built-in "retry + re-allocate" mechanism right now, propagating the 500 error doesn't totally seem wrong to me. More generally - what should we do if we make an allocation to a sled that has some fault when we try actually assigning the resource there?

Does this depend on the nature of the Crucible failure we're simulating? If Crucible itself is failing with the equivalent of a 503, then I'd expect the external request to fail with a 503 as well. If Crucible is failing with its own 500, I could go either way. I guess the question is: do we expect the problem is transient and that retrying might help? If so, I think a 503 is more appropriate. It seems like that's the case here if you think Nexus might handle this by retrying.

To make it super concrete, the state of Crucible is returned through this enum: https://github.com/oxidecomputer/crucible/blob/eda2dabb91228bc20b2b21599bdc05409ecfae68/agent/src/model.rs#L6-L14

Normally, we expect to request a region, and see the state transition from Requested to Created. However, from that API, Failed may also be returned. Frankly I'm unsure if I'd consider this a 500 or 503 from Crucible, as there's no indication whether retrying could potentially fix things.

I could see an argument in favor of 503 from Nexus, regardless of the response from Crucible - if the region allocation fails, we probably should retry, and if Nexus is clever enough (in the vein of RFD 237) it should deprioritize allocations to that failing dataset. I think I can actually see a parallel for instance allocation too - if an error occurs trying to set up propolis, should we tell the caller to retry the whole request again so that we allocate to a different sled?

On the other hand, doing something like this would effectively mask all 500 errors from Sleds, and avoid propagating those errors to callers. Maybe that's okay? But if a 500 error would be returned from all sleds making the allocation (e.g., due to a software bug), masking it as 503 seems dubious.

WDYT? I'm finding it kinda hard to make a call without getting a better idea of what State::Failed means in this context.

tl;r: I think we may still want to change this but the more I think about it, there are enough ambiguous cases that the choice isn't super clear. I would probably bias toward 503 but 500 also seems fine and either way we'll be informed by operational experience.

I could see an argument in favor of 503 from Nexus, regardless of the response from Crucible - if the region allocation fails, we probably should retry, and if Nexus is clever enough (in the vein of RFD 237) it should deprioritize allocations to that failing dataset. I think I can actually see a parallel for instance allocation too - if an error occurs trying to set up propolis, should we tell the caller to retry the whole request again so that we allocate to a different sled?

I think so. Also, I think that's true regardless of whether Nexus retries internally or not.

On the other hand, doing something like this would effectively mask all 500 errors from Sleds, and avoid propagating those errors to callers. Maybe that's okay? But if a 500 error would be returned from all sleds making the allocation (e.g., due to a software bug), masking it as 503 seems dubious.

That seems okay to me. The problem is we have no way to know which case we're in. Assuming that the failure is equivalent to a 500 (and so: a bug in Crucible):

Case 1: if you retry again, it definitely won't succeed. That means the request triggers a buggy code path in all currently-deployed Crucible instances. Why would this happen? Either:

This request [or set of conditions] has always triggered the bug.

There's a regression in Crucible and we didn't catch it in the canary phase.

Every Crucible instance has gotten into a bad state that prevents it from serving this request. For example, maybe it doesn't handle out-of-disk-space properly and they all ran out of disk space. This is really a case of "this request [or set of conditions] has always triggered this bug", in that it's not a regression.

Case 2: If you retry again, it might succeed. That means the request triggers a buggy code path in only some Crucible instances. So a 503 makes more sense. Example: this could happen if we're updating to a Crucible version with a regression, we're still in the canary phase, and the request got unlucky.

So we don't know what case we're in but we have to make a judgment. Case 2 (we accidentally deployed a regression) seems more likely to me than we suddenly discover some longstanding bug, but the latter definitely happens too! If the Failed variant doesn't necessarily mean a bug, but could also include cases equivalent to a 503, then I think that's all the more reason to err toward 503. But hopefully all of these cases are uncommon, and we'll be monitoring all of them, and it won't make a huge difference. If it does, we'll have data in hand to better understand what happened and how to treat this. So maybe just punt on this.

smklein · 2022-02-03T00:58:43Z

nexus/test-utils/src/resource_helpers.rs

+                description: String::from("sells rainsticks"),
+            },
+            snapshot_id: None,
+            size: ByteCount::from_gibibytes_u32(1),


This is definitely something some tests would like to customize, but I suppose it doesn't matter in the general case.

We don't need to change this now, but I wonder if a builder pattern might accommodate both use-cases.

Agreed. It occurred to me while adding this that the analogs for Organization and Project don't take create-time arguments, but that's because there basically aren't any that are useful for testing. If (or when) we find it's useful for Disks it should be easy to add.

smklein · 2022-02-03T01:00:09Z

nexus/tests/integration_tests/disks.rs

+    NexusRequest::new(
+        RequestBuilder::new(client, Method::POST, &disks_url)
+            .body(Some(&new_disk))
+            .expect_status(Some(StatusCode::INTERNAL_SERVER_ERROR)),


For this specific test, I actually think this is the right error.

The test is not checking for "what if we don't have enough regions" (as reported in #659) - in that case, we would return SERVICE_UNAVAILABLE, as we patched in #660.

This test:

Creates enough datasets for the allocation to succeed,

It even checks there is enough space to the disk allocation to succeed,

However, it makes sure that the "crucible agents" attached to the datasets fail when sending the request to allocate.

This is the crux of what we're testing: "What if allocation succeeds, but the request to the crucible agent fails?"

As Nexus has no built-in "retry + re-allocate" mechanism right now, propagating the 500 error doesn't totally seem wrong to me. More generally - what should we do if we make an allocation to a sled that has some fault when we try actually assigning the resource there?

smklein · 2022-02-03T01:13:54Z

nexus/tests/integration_tests/disks.rs

 }

 async fn disk_get(client: &ClientTestContext, disk_url: &str) -> Disk {
    object_get::<Disk>(client, disk_url).await
 }

 async fn disks_list(client: &ClientTestContext, list_url: &str) -> Vec<Disk> {
-    objects_list_page::<Disk>(client, list_url).await.items
+    NexusRequest::iter_collection_authn(client, list_url, "", 10)


nit: 10 is a magic number, could we make this a constant?

Sure. There are 5-10 other callers, and nearly all of them don't really care what value is provided here. So I've changed iter_collection_authn to accept an Option<usize> and put the named constant inside that function. It's used if you pass None.

Update to Propolis commit 50cb28f5 to obtain the following updates: ``` 50cb28f server: separate statuses of inbound and outbound migrations (#661) 61e1481 Don't return None from BlockData::block_for_req when attachment is paused (#711) 79e243b Build and clippy fixes for Rust 1.79 (#712) 3c0f998 Modernize fw_cfg emulation 7a65be9 Update and prune dependencies ``` 50cb28f5 changes the Propolis instance status and migration status APIs to report both inbound and outbound migration status. Update sled agent accordingly. Tests: `cargo nextest`, smoke test on a dev cluster.

davepacheco added 7 commits January 27, 2022 16:45

authz: make it easier to test authn/authz protection for new endpoints

4086d58

rip out now-redundant tests

2c28623

clean up and document new test

3eec4e3

self-review: clean up

af6b8d7

Merge branch 'main' into authz-tests

13e0c74

authz: protect disk list endpoint

80b0013

protect create too and update tests

01e7681

davepacheco changed the base branch from main to authz-tests January 31, 2022 20:40

fix oxapi_demo

ce5057c

davepacheco commented Jan 31, 2022

View reviewed changes

davepacheco requested a review from smklein January 31, 2022 20:44

Base automatically changed from authz-tests to main January 31, 2022 20:47

davepacheco mentioned this pull request Jan 31, 2022

tracking issue for authz work #419

Closed

71 tasks

Merge branch 'main' into authz-work

9e02487

davepacheco mentioned this pull request Feb 2, 2022

authz: update disk lookup and protect disk get/attach/detach #662

Merged

davepacheco requested a review from david-crespo February 2, 2022 19:37

smklein approved these changes Feb 3, 2022

View reviewed changes

davepacheco added 2 commits February 3, 2022 08:55

review feedback: magic number

88a08f9

Merge branch 'main' into authz-work

b642db8

davepacheco merged commit 1640fca into main Feb 3, 2022

davepacheco deleted the authz-work branch February 3, 2022 18:00

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

authz: protect disk list, disk create endpoints #661

authz: protect disk list, disk create endpoints #661

Uh oh!

davepacheco commented Jan 31, 2022

Uh oh!

davepacheco Jan 31, 2022

Uh oh!

davepacheco Jan 31, 2022

Uh oh!

smklein Feb 3, 2022

Uh oh!

davepacheco Feb 3, 2022

Uh oh!

smklein Feb 3, 2022 •

edited

Loading

Uh oh!

davepacheco Feb 3, 2022

Uh oh!

smklein Feb 3, 2022

Uh oh!

davepacheco Feb 3, 2022

Uh oh!

smklein Feb 3, 2022

Uh oh!

smklein Feb 3, 2022

Uh oh!

davepacheco Feb 3, 2022

Uh oh!

Uh oh!

authz: protect disk list, disk create endpoints #661

authz: protect disk list, disk create endpoints #661

Uh oh!

Conversation

davepacheco commented Jan 31, 2022

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

smklein Feb 3, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

smklein Feb 3, 2022 •

edited

Loading