-
Notifications
You must be signed in to change notification settings - Fork 210
Async hash and read #2089
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Async hash and read #2089
Conversation
aapoalas
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Peanut gallery passing by again after a spell.
| self.qspi.sector_erase(write_addr as u32); | ||
| self.poll_for_write_complete(Some(1)); | ||
| self.qspi | ||
| .sector_erase(write_addr as u32) |
This comment was marked as outdated.
This comment was marked as outdated.
Sorry, something went wrong.
drv/cosmo-hf/src/hf.rs
Outdated
| } | ||
|
|
||
| enum SlotHash { | ||
| // Nothing has requested a hash of this slot yet |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
suggestion: Doc comment maybe?
drv/cosmo-hf/src/hf.rs
Outdated
| } | ||
| } | ||
|
|
||
| // This assumes `begin` and `end` have been bounds checked |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
suggestion: This sounds like it definitely should be a doc comment; should the method be unsafe as well since it has a precondition?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It doesn't look like misusing this can violate Rust safety, so I don't think unsafe is necessary.
As a reader it would be helpful to say how these are bounds checked, like, are these sector numbers or byte indices, and are they bounds checked against the current flash chip or a portion of it, etc.
drv/cosmo-hf/src/hf.rs
Outdated
| // The only way we should get an error from this is if | ||
| // we somehow call update before we've initialized or | ||
| // after we've finished the hash. | ||
| self.hash_range_update(addr, addr + step_size).unwrap(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
question: Should these be using unwrap_lite()?
Also, is it possible for addr + step_size to overflow with a range at the end of the memory range that's not quite aligned to block sizes? Presumably that's not possible, in which case this could perhaps be a unchecked add. Otherwise, a checked add might be in order.
Can also reuse the addr + step_size result below just to make sure the compiler doesn't accidentally miss the deduplication chance.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This calculation could technically overflow if the addr value is chosen to be very large, and the end value is chosen to be within 256 of u32::MAX (if I'm doing the math right). If this happens, we'll panic.
If we want to address this, we really need to maintain some invariants over the initial value of addr and step_size, I'm not sure altering the addition operation is the right approach.
I think this is a relatively low risk though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(But, yes, these should probably be unwrap_lite)
| _: &RecvMessage, | ||
| _slot: HfDevSelect, | ||
| ) -> Result<[u8; SHA256_SZ], RequestError<HfError>> { | ||
| Ok(*b"mockmockmockmockmockmockmockmock") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
praise: :awsm:
| // perform transfers. | ||
| let mut out = out; | ||
| while !out.is_empty() { | ||
| let sr = self.reg.sr.read(); |
This comment was marked as outdated.
This comment was marked as outdated.
Sorry, something went wrong.
| self.reg | ||
| .cr | ||
| .modify(|_, w| w.ftie().clear_bit().tcie().clear_bit()); | ||
| self.reg.cr.modify(|_, w| { |
This comment was marked as outdated.
This comment was marked as outdated.
Sorry, something went wrong.
d95a07c to
d1b53db
Compare
drv/gimlet-hf-server/src/main.rs
Outdated
| } | ||
|
|
||
| // This does a sha256 on the entire range _except_ sector0 | ||
| // which is treted as `0xff` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typo:
| // which is treted as `0xff` | |
| // which is treated as `0xff` |
drv/gimlet-hf-server/src/main.rs
Outdated
| dev: HfDevSelect, | ||
| ) -> Result<[u8; SHA256_SZ], RequestError<HfError>> { | ||
| match dev { | ||
| HfDevSelect::Flash0 => match self.cached_hash0 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
suggestion: This SlotHash => Result<[u8; SHA256_SZ, RequestError<HfError>> repeats 4 times over; would a From impl or such perhaps be warranted to deduplicate the code?
cbiffle
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The thing where I can't leave a comment on lines you didn't touch is still unfortunate.
In drv/gimlet-hf-server/src/main.rs, I think you want to invalidate the cached hash at line 511, 674, 686, and possibly in the set_mux IPC to be sure. Check the similar operations in cosmo too.
I think we might want to block attempts to change the mux while an async hash is in progress. There are a couple of comments below where I noted potential problems with permitting mux changes -- in particular, it allows you to panic the hf task mid-hash. But it also seems dodgy from a TOCTOU perspective, even if the race window is exceptionally short. If we decide to do this, some of the code becomes simpler -- we no longer have to save and restore the dev setting (and instead maybe just check that it's still SP).
Cargo.toml
Outdated
| attest-data = { git = "https://github.com/oxidecomputer/dice-util", default-features = false, version = "0.4.0" } | ||
| dice-mfg-msgs = { git = "https://github.com/oxidecomputer/dice-util", default-features = false, version = "0.2.1" } | ||
| gateway-messages = { git = "https://github.com/oxidecomputer/management-gateway-service", default-features = false, features = ["smoltcp"] } | ||
| gateway-messages = { git = "https://github.com/oxidecomputer/management-gateway-service", branch = "read_host_flash", default-features = false, features = ["smoltcp"] } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just to make it less likely that we forget this, this should get fixed up before merge.
drv/cosmo-hf/src/hf.rs
Outdated
| } | ||
| } | ||
|
|
||
| // This assumes `begin` and `end` have been bounds checked |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It doesn't look like misusing this can violate Rust safety, so I don't think unsafe is necessary.
As a reader it would be helpful to say how these are bounds checked, like, are these sector numbers or byte indices, and are they bounds checked against the current flash chip or a portion of it, etc.
| .unwrap_lite(); | ||
| if let Err(e) = self.hash.update(size as u32, &buf[..size]) { | ||
| ringbuf_entry!(Trace::HashUpdateError(e)); | ||
| return Err(HfError::HashError); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(You don't necessarily need to change this in this PR, but I've been trying to notice cases where we're putting useful information in a ringbuf and not returning it, so that I can point out: we can return things in errors now, and that pattern is generally more useful when we're debugging over the network without full ringbuf access.)
drv/cosmo-hf/src/hf.rs
Outdated
| // The only way we should get an error from this is if | ||
| // we somehow call update before we've initialized or | ||
| // after we've finished the hash. | ||
| self.hash_range_update(addr, addr + step_size).unwrap(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This calculation could technically overflow if the addr value is chosen to be very large, and the end value is chosen to be within 256 of u32::MAX (if I'm doing the math right). If this happens, we'll panic.
If we want to address this, we really need to maintain some invariants over the initial value of addr and step_size, I'm not sure altering the addition operation is the right approach.
I think this is a relatively low risk though.
drv/cosmo-hf/src/hf.rs
Outdated
| // The only way we should get an error from this is if | ||
| // we somehow call update before we've initialized or | ||
| // after we've finished the hash. | ||
| self.hash_range_update(addr, addr + step_size).unwrap(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(But, yes, these should probably be unwrap_lite)
drv/gimlet-hf-server/src/main.rs
Outdated
| addr: SECTOR_SIZE_BYTES, | ||
| end: self.capacity, | ||
| }; | ||
| sys_set_timer(Some(sys_get_timer().now + 1), notifications::TIMER_MASK); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NB: sys_set_timer_relative provides a simpler API for this that also bakes in avoiding a bounds check and panic (which can't happen in reality but the compiler doesn't know that). In case that's useful.
drv/gimlet-hf-server/src/main.rs
Outdated
| let step_size = BLOCK_STEP_SIZE; | ||
|
|
||
| let prev = self.dev_state; | ||
| self.set_dev(dev).unwrap(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've gone and looked at other cases where we're unwrapping set_dev, and in each case it's asserted to be safe because we've checked that the flash is muxed to the SP. They could make this assertion because the code was straight-line and didn't potentially handle other IPCs asking to e.g. switch the mux back.
I don't think this property holds in the async case, so I think this is effectively an IPC/network controlled panic, unless I missed the interlock. We could do two different things to prevent this.
- We could check for the Err result here and stop the hashing process because the mux has been switched. I'm a bit nervous about that since, in theory, you could switch the mux back and forth between two calls to
step_hashwithout us noticing. (Yeah, you'd have to do that in under a millisecond, but, still.) - We could block requests to switch the mux while a hash is in progress, by adding code to the
set_muxpath (and anywhere else that flips the mux). This appeals to me, since it ensures that the hash is hashing consistent stuff without interlopers.
Thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My initial analysis is partially wrong. Spoke with @labbott about this directly and the problem is smaller than I'd anticipated, because the IPC handlers try to invalidate the cache where appropriate, which in turn prevents step_hash from doing anything useful (and panicking).
However, there's at least one path where this protection is incomplete rn: you can call page_program_dev with an invalid argument and arrange for it to call set_dev, then delegate to page_program, which rejects the argument and returns an error -- leaving the mux flipped.
We're poking at the code to see if there's a more central place to put cache invalidation to prevent this class of error.
drv/gimlet-hf-server/src/main.rs
Outdated
| }, | ||
| Err(_) => (), | ||
| }, | ||
| HashState::Hashing { .. } => sys_set_timer( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(consider sys_set_timer_relative here, it's shorter and cheaper)
drv/gimlet-hf-server/src/main.rs
Outdated
| // we somehow call update before we've initialized or | ||
| // after we've finished the hash. | ||
| self.hash_range_update(addr, addr + step_size).unwrap(); | ||
| self.set_dev(prev).unwrap(); // infallible if the earlier set_dev worked |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Conceptually: why would we allow the dev to be switched to host while we're hashing? Wouldn't that potentially raise TOCTOU issues?
So do we need to be setting/restoring the mux setting each time we go through this, or should we maintain the invariant that if we're in a hashing state, we have set the mux to SP, and we'll restore it at the very end if at all?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(Laura's clarified the design intent here, which is that you can switch the mux all you want, but the hash should be invalidated if you do so. I think that means this save/restore code and its unwraps can be removed.)
|
...Laura has also pointed out that I've confused Re-checking. |
|
Should we put the hash data into a common place? Right now, it looks like there's a decent amount of copypasta between Something like struct HashData {
hash: drv_hash_api::Hash,
cached_hash0: SlotHash,
cached_hash1: SlotHash,
state: HashState,
}
enum HashState {
Done,
Hashing {
dev: HfDevSelect,
addr: usize,
end: usize,
},
Uninitialized,
}
enum SlotHash {
// Nothing has requested a hash of this slot yet
Uncalculated,
// Something has modified this slot
Recalculate,
// In progress calculation
HashInProgress,
// A valid hash
Hash([u8; SHA256_SZ]),
}(then move more common code into methods on the new |
I think that's reasonable. This started out as gimlet only then I realized I needed cosmo too. It does look like we can use the same approach in both places. |
|
Alright, so, there is an actual async state machine issue here, but it's not exactly what I described, as I'm paging this code back in. The issue is that you can freely switch the mux while a hash computation is in progress. (The mux is not the dev.) This is probably bad, since it implies potentially giving the host write access, or just corrupting the hash computation. I currently feel like the best compromise might be:
This also suggests that we might want to reconsider aggressively invalidating the computed hash on other mutation operations -- since it's still an accurate description of the flash contents, even if (say) a page erase has happened since. I'm less convinced of this though, and think consistent invalidation is the safest option. |
8ea13de to
ae79b7c
Compare
| impl NotificationHandler for ServerImpl { | ||
| fn current_notification_mask(&self) -> u32 { | ||
| 0 | ||
| notifications::TIMER_MASK |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we only request this notification if the hash is in progress? I guess it's harmless to always have the bit active...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems to match what we do throughout hubris
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay!
Introduce three new commands: `ReadHostFlash`, `StartHostFlashHash` and `GetHostFlashHash`. `ReadHostFlash` is self-explanatory. `StartHostFlashHash` starts an asynchronous calculation of a host flash slot and `GetHostFlashHash` retrieves it when available.
e9e99a3 to
d9d7823
Compare
No description provided.