Skip to content

Conversation

@labbott
Copy link
Collaborator

@labbott labbott commented Jun 11, 2025

No description provided.

Copy link
Contributor

@aapoalas aapoalas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Peanut gallery passing by again after a spell.

self.qspi.sector_erase(write_addr as u32);
self.poll_for_write_complete(Some(1));
self.qspi
.sector_erase(write_addr as u32)

This comment was marked as outdated.

}

enum SlotHash {
// Nothing has requested a hash of this slot yet
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion: Doc comment maybe?

}
}

// This assumes `begin` and `end` have been bounds checked
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion: This sounds like it definitely should be a doc comment; should the method be unsafe as well since it has a precondition?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It doesn't look like misusing this can violate Rust safety, so I don't think unsafe is necessary.

As a reader it would be helpful to say how these are bounds checked, like, are these sector numbers or byte indices, and are they bounds checked against the current flash chip or a portion of it, etc.

// The only way we should get an error from this is if
// we somehow call update before we've initialized or
// after we've finished the hash.
self.hash_range_update(addr, addr + step_size).unwrap();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question: Should these be using unwrap_lite()?

Also, is it possible for addr + step_size to overflow with a range at the end of the memory range that's not quite aligned to block sizes? Presumably that's not possible, in which case this could perhaps be a unchecked add. Otherwise, a checked add might be in order.

Can also reuse the addr + step_size result below just to make sure the compiler doesn't accidentally miss the deduplication chance.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This calculation could technically overflow if the addr value is chosen to be very large, and the end value is chosen to be within 256 of u32::MAX (if I'm doing the math right). If this happens, we'll panic.

If we want to address this, we really need to maintain some invariants over the initial value of addr and step_size, I'm not sure altering the addition operation is the right approach.

I think this is a relatively low risk though.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(But, yes, these should probably be unwrap_lite)

_: &RecvMessage,
_slot: HfDevSelect,
) -> Result<[u8; SHA256_SZ], RequestError<HfError>> {
Ok(*b"mockmockmockmockmockmockmockmock")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

praise: :awsm:

// perform transfers.
let mut out = out;
while !out.is_empty() {
let sr = self.reg.sr.read();

This comment was marked as outdated.

self.reg
.cr
.modify(|_, w| w.ftie().clear_bit().tcie().clear_bit());
self.reg.cr.modify(|_, w| {

This comment was marked as outdated.

@labbott labbott force-pushed the async_hash_and_read branch from d95a07c to d1b53db Compare June 12, 2025 14:16
@labbott labbott requested review from cbiffle, hawkw and mkeeter June 12, 2025 14:21
}

// This does a sha256 on the entire range _except_ sector0
// which is treted as `0xff`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo:

Suggested change
// which is treted as `0xff`
// which is treated as `0xff`

dev: HfDevSelect,
) -> Result<[u8; SHA256_SZ], RequestError<HfError>> {
match dev {
HfDevSelect::Flash0 => match self.cached_hash0 {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion: This SlotHash => Result<[u8; SHA256_SZ, RequestError<HfError>> repeats 4 times over; would a From impl or such perhaps be warranted to deduplicate the code?

Copy link
Collaborator

@cbiffle cbiffle left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The thing where I can't leave a comment on lines you didn't touch is still unfortunate.

In drv/gimlet-hf-server/src/main.rs, I think you want to invalidate the cached hash at line 511, 674, 686, and possibly in the set_mux IPC to be sure. Check the similar operations in cosmo too.

I think we might want to block attempts to change the mux while an async hash is in progress. There are a couple of comments below where I noted potential problems with permitting mux changes -- in particular, it allows you to panic the hf task mid-hash. But it also seems dodgy from a TOCTOU perspective, even if the race window is exceptionally short. If we decide to do this, some of the code becomes simpler -- we no longer have to save and restore the dev setting (and instead maybe just check that it's still SP).

Cargo.toml Outdated
attest-data = { git = "https://github.com/oxidecomputer/dice-util", default-features = false, version = "0.4.0" }
dice-mfg-msgs = { git = "https://github.com/oxidecomputer/dice-util", default-features = false, version = "0.2.1" }
gateway-messages = { git = "https://github.com/oxidecomputer/management-gateway-service", default-features = false, features = ["smoltcp"] }
gateway-messages = { git = "https://github.com/oxidecomputer/management-gateway-service", branch = "read_host_flash", default-features = false, features = ["smoltcp"] }
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to make it less likely that we forget this, this should get fixed up before merge.

}
}

// This assumes `begin` and `end` have been bounds checked
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It doesn't look like misusing this can violate Rust safety, so I don't think unsafe is necessary.

As a reader it would be helpful to say how these are bounds checked, like, are these sector numbers or byte indices, and are they bounds checked against the current flash chip or a portion of it, etc.

.unwrap_lite();
if let Err(e) = self.hash.update(size as u32, &buf[..size]) {
ringbuf_entry!(Trace::HashUpdateError(e));
return Err(HfError::HashError);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(You don't necessarily need to change this in this PR, but I've been trying to notice cases where we're putting useful information in a ringbuf and not returning it, so that I can point out: we can return things in errors now, and that pattern is generally more useful when we're debugging over the network without full ringbuf access.)

// The only way we should get an error from this is if
// we somehow call update before we've initialized or
// after we've finished the hash.
self.hash_range_update(addr, addr + step_size).unwrap();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This calculation could technically overflow if the addr value is chosen to be very large, and the end value is chosen to be within 256 of u32::MAX (if I'm doing the math right). If this happens, we'll panic.

If we want to address this, we really need to maintain some invariants over the initial value of addr and step_size, I'm not sure altering the addition operation is the right approach.

I think this is a relatively low risk though.

// The only way we should get an error from this is if
// we somehow call update before we've initialized or
// after we've finished the hash.
self.hash_range_update(addr, addr + step_size).unwrap();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(But, yes, these should probably be unwrap_lite)

addr: SECTOR_SIZE_BYTES,
end: self.capacity,
};
sys_set_timer(Some(sys_get_timer().now + 1), notifications::TIMER_MASK);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NB: sys_set_timer_relative provides a simpler API for this that also bakes in avoiding a bounds check and panic (which can't happen in reality but the compiler doesn't know that). In case that's useful.

let step_size = BLOCK_STEP_SIZE;

let prev = self.dev_state;
self.set_dev(dev).unwrap();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've gone and looked at other cases where we're unwrapping set_dev, and in each case it's asserted to be safe because we've checked that the flash is muxed to the SP. They could make this assertion because the code was straight-line and didn't potentially handle other IPCs asking to e.g. switch the mux back.

I don't think this property holds in the async case, so I think this is effectively an IPC/network controlled panic, unless I missed the interlock. We could do two different things to prevent this.

  1. We could check for the Err result here and stop the hashing process because the mux has been switched. I'm a bit nervous about that since, in theory, you could switch the mux back and forth between two calls to step_hash without us noticing. (Yeah, you'd have to do that in under a millisecond, but, still.)
  2. We could block requests to switch the mux while a hash is in progress, by adding code to the set_mux path (and anywhere else that flips the mux). This appeals to me, since it ensures that the hash is hashing consistent stuff without interlopers.

Thoughts?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My initial analysis is partially wrong. Spoke with @labbott about this directly and the problem is smaller than I'd anticipated, because the IPC handlers try to invalidate the cache where appropriate, which in turn prevents step_hash from doing anything useful (and panicking).

However, there's at least one path where this protection is incomplete rn: you can call page_program_dev with an invalid argument and arrange for it to call set_dev, then delegate to page_program, which rejects the argument and returns an error -- leaving the mux flipped.

We're poking at the code to see if there's a more central place to put cache invalidation to prevent this class of error.

},
Err(_) => (),
},
HashState::Hashing { .. } => sys_set_timer(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(consider sys_set_timer_relative here, it's shorter and cheaper)

// we somehow call update before we've initialized or
// after we've finished the hash.
self.hash_range_update(addr, addr + step_size).unwrap();
self.set_dev(prev).unwrap(); // infallible if the earlier set_dev worked
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Conceptually: why would we allow the dev to be switched to host while we're hashing? Wouldn't that potentially raise TOCTOU issues?

So do we need to be setting/restoring the mux setting each time we go through this, or should we maintain the invariant that if we're in a hashing state, we have set the mux to SP, and we'll restore it at the very end if at all?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Laura's clarified the design intent here, which is that you can switch the mux all you want, but the hash should be invalidated if you do so. I think that means this save/restore code and its unwraps can be removed.)

@cbiffle
Copy link
Collaborator

cbiffle commented Jun 12, 2025

...Laura has also pointed out that I've confused set_dev and set_mux in several places, so, I may be entirely off base here. Wheeeeee

Re-checking.

@mkeeter
Copy link
Collaborator

mkeeter commented Jun 12, 2025

Should we put the hash data into a common place? Right now, it looks like there's a decent amount of copypasta between cosmo-hf and gimlet-hf.

Something like

struct HashData {
    hash: drv_hash_api::Hash,
    cached_hash0: SlotHash,
    cached_hash1: SlotHash,
    state: HashState,
}

enum HashState {
    Done,
    Hashing {
        dev: HfDevSelect,
        addr: usize,
        end: usize,
    },
    Uninitialized,
}

enum SlotHash {
    // Nothing has requested a hash of this slot yet
    Uncalculated,
    // Something has modified this slot
    Recalculate,
    // In progress calculation
    HashInProgress,
    // A valid hash
    Hash([u8; SHA256_SZ]),
}

(then move more common code into methods on the new HashData)

@labbott
Copy link
Collaborator Author

labbott commented Jun 12, 2025

Should we put the hash data into a common place? Right now, it looks like there's a decent amount of copypasta between cosmo-hf and gimlet-hf.

I think that's reasonable. This started out as gimlet only then I realized I needed cosmo too. It does look like we can use the same approach in both places.

@cbiffle
Copy link
Collaborator

cbiffle commented Jun 12, 2025

Alright, so, there is an actual async state machine issue here, but it's not exactly what I described, as I'm paging this code back in.

The issue is that you can freely switch the mux while a hash computation is in progress. (The mux is not the dev.) This is probably bad, since it implies potentially giving the host write access, or just corrupting the hash computation.

I currently feel like the best compromise might be:

  • If the mux is flipped, cancel any ongoing hash computation.
  • But retain any previously computed result, since it represents a useful capture of the flash state at the point in time when it was computed.

This also suggests that we might want to reconsider aggressively invalidating the computed hash on other mutation operations -- since it's still an accurate description of the flash contents, even if (say) a page erase has happened since. I'm less convinced of this though, and think consistent invalidation is the safest option.

@labbott labbott force-pushed the async_hash_and_read branch 2 times, most recently from 8ea13de to ae79b7c Compare June 17, 2025 18:08
impl NotificationHandler for ServerImpl {
fn current_notification_mask(&self) -> u32 {
0
notifications::TIMER_MASK
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we only request this notification if the hash is in progress? I guess it's harmless to always have the bit active...

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems to match what we do throughout hubris

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay!

Introduce three new commands: `ReadHostFlash`, `StartHostFlashHash` and
`GetHostFlashHash`. `ReadHostFlash` is self-explanatory.
`StartHostFlashHash` starts an asynchronous calculation of a host flash
slot and `GetHostFlashHash` retrieves it when available.
@labbott labbott force-pushed the async_hash_and_read branch from e9e99a3 to d9d7823 Compare June 19, 2025 00:23
@labbott labbott merged commit 3f419ef into master Jun 23, 2025
135 checks passed
@labbott labbott deleted the async_hash_and_read branch June 23, 2025 14:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants