Skip to content

Conversation

@labbott
Copy link
Collaborator

@labbott labbott commented Nov 20, 2025

No description provided.

@labbott labbott changed the title Host flash performance improvements Cosmo Host flash performance improvements Nov 20, 2025
let cnt = SECTOR_SIZE_BYTES / (PAGE_SIZE_BYTES as u32);
for i in 0..cnt {
let addr = addr.0 + i * (PAGE_SIZE_BYTES as u32);
self.clear_fifos();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you could move everything except self.drv.addr.set_addr(addr) out of this loop, dunno how much of a performance impact that would have.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What actually triggers the start of the transaction? is it the addr or the OpCode? I'm not a huge fan of breaking this up, especially since there doesn't seem to be a notable increase in performance.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, if we're going to do this, it seems worth consulting the datasheet to be sure that the other configurations aren't cleared by the transaction completing?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIRC writing the OpCode is what triggers the start of the transaction. The other fields are internal to the FPGA, so we're the ones that define their behavior (e.g. whether they clear), but I don't feel strongly about saving a few writes.

}
self.drv.instr.set_opcode(instr::QUAD_INPUT_PAGE_PROGRAM_4B);
self.wait_fpga_busy();
/// This function will only return an error if the slice len is not a full
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It makes sense that this now takes a slice, but I'm not sure why this PR changes it to only accept a single page. Was the overhead of breaking the slice into page chunks too much? That seems implausible, but if that's the case, the signature should change to take &[u8; PAGE_SIZE] to enforce correctness on the callers.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmmm, now that you mentioned it I don't think this is correct for the persistent case since that's less than a page. page_program is limited to a page size and I don't think the current loop correctly implements the address for more than a page https://github.com/oxidecomputer/hubris/blob/master/drv/cosmo-hf/src/main.rs#L326-L354 (unless I've misread something)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that the lack of address updates seems wrong in the existing code 😅

Since persistent data writes < 1 page, how about fixing that in the existing (arbitrary-length) write?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I miss what "that" is referring to here but I was going to relax the requirement to be limited to a single page write. I think most of the code elsewhere is already breaking things down in to page sizes so it seems redundant to do that here too.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My suggestion is to allow flash_write to write arbitrary-length buffers (i.e. the previous implementation), but fix the bug that you noticed where addr isn't incremented between pages.

This has the advantage that the function would become infallible, so callers wouldn't need to think about error handling. If callers are already splitting stuff up into page-size transactions, this would add a tiny bit of overhead, but save an unwrap / unwrap_lite.

@aapoalas
Copy link
Contributor

aapoalas commented Nov 20, 2025

The tlvc PR I've had open for a while might be worth returning to for Cosmo perf work: it's probably not a large effect on performance one way or the other, but it does probably remove panics so that's always nice.

I also have done some work on the Cosmo host flash APIs in a branch to remove panics here; I think it would require a little bit of more work to make it worth a PR, but it did also remove some panics so that's nice. (I think I also found one bug IIRC.) If you'd be willing to review it, I can take a look at freshening up that work.

@labbott
Copy link
Collaborator Author

labbott commented Nov 20, 2025

I also have done some work on the Cosmo host flash APIs in a branch to remove panics here; I think it would require a little bit of more work to make it worth a PR, but it did also remove some panics so that's nice. (I think I also found one bug IIRC.) If you'd be willing to review it, I can take a look at freshening up that work.

It's a little hard to say without actually seeing the code. I'm not opposed to removing panics (I've sometimes had a war against unwrap) but it can also be a trade off for having the code be harder to follow.

@aapoalas
Copy link
Contributor

aapoalas commented Nov 20, 2025

It's a little hard to say without actually seeing the code. I'm not opposed to removing panics (I've sometimes had a war against unwrap) but it can also be a trade off for having the code be harder to follow.

The commits is here but it's currently full of wrapping arithmetic to remove arithmetic panics that the compiler couldn't figure out, together with comments to explain why there's no chance of an overflow/wrap there: this is "correct" but not actually correct. Freshening this up would consist of replacing the wrapping arithmetic with such code transformations, together with checked or unchecked arithmetic as necessary, to properly explain the intended semantics of the code to the compiler, to enable it to remove the panics on its own and unsafely assert them when it can't.

Comment on lines 299 to 300
// We need to complete the read of the full PAGE_SIZE
// even if we find a non-erased byte
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is that to ensure that the transaction has completed before trying to do something else?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I was seeing what looked like incorrect behavior (mismatch of written vs expected data) without completing the full PAGE_SIZE read.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe worth noting that a little more explicitly in this comment?

Copy link
Contributor

@nathanaelhuffman nathanaelhuffman Nov 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is that possibly because you just need to bleed down the FIFOs? The hardware here will stuff the remainign bytes into the FIFOs. If you have waited for the transaction to finish, you could optionally use the fifo clear bits to nuke them? This is basically a "you asked for PAGE_SIZE bytes, so you get PAGE_SIZE bytes into the response FIFO" I think

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so the loop had self.clear_fifos() each time which I thought should be clearing everything. I'm also now questioning if this was actually incorrect given I also spent too long dealing with oxidecomputer/humility#585 . Perhaps I will check again...

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried this again, I think I was missing another wait_for_fpga for the FPGA to actually be finished processing after draining the FIFO. With that in we don't actually see any difference in exiting out early vs just completing all the reads, probably because it still takes just as much time to drain. I'll tweak the comment.

@aapoalas
Copy link
Contributor

I opened #2296 from the perf work I had done earlier, fixed up to not use wrapping arithmetic all over the place.

With those changes in place and the tlvc PR in use as well, there are only a handful of panics or unwraps in the cosmo-b-dev/dist/auxflash and at least none in any direct hot loops that I can see. Maybe some of that stuff is useful here?

The Winbond flash chip on Cosmo has a unique flash erase algorithm
that significantly penalizes erasing already erased sectors.
Check for this case before running a sector erase command.

Before
real	3m31.962s
user	0m4.325s
sys	0m3.188s

After
real	2m0.817s
user	0m4.380s
sys	0m3.305s
Flash memory is set to all `1` when erased. We have to erase pages
before writing so writing pages full of `1` is not a good use of
CPU time.

Before
real    2m0.817s
user    0m4.380s
sys     0m3.305s

After
real	1m44.543s
user	0m3.586s
sys	0m2.710s
Copy link
Contributor

@nathanaelhuffman nathanaelhuffman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ty for the work here!

self.drv.dummy_cycles.set_count(8);
self.drv.instr.set_opcode(instr::FAST_READ_QUAD_OUTPUT_4B);
let mut erased = true;
// Technicaly we could terminiate this loop early when
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit:

Suggested change
// Technicaly we could terminiate this loop early when
// Technically we could terminate this loop early when

Comment on lines +353 to +354
/// This function will only return an error if the slice len is greater than
/// a page
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
/// This function will only return an error if the slice len is greater than
/// a page

Comment on lines +299 to +303
// Technicaly we could terminiate this loop early when
// we find the first non-erased byte but we still have
// to drain the FIFOs and wait for the FPGA which means
// there's no performance gain vs just reading everything
// in a loop here.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for adding this, this is a lot clearer. i noticed a couple typos and grammar nits:

Suggested change
// Technicaly we could terminiate this loop early when
// we find the first non-erased byte but we still have
// to drain the FIFOs and wait for the FPGA which means
// there's no performance gain vs just reading everything
// in a loop here.
// Technically, we could terminate this loop early when
// we find the first non-erased byte, but we still have
// to drain the FIFOs and wait for the FPGA, which means
// there's no performance gain vs just reading everything
// in a loop here.

Comment on lines +372 to +373
for (i, b) in chunk.iter().enumerate() {
v[i] = *b;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suuuuper nitpick, please feel free to ignore me: i think that iter().enumerate() over chunk elides bounds-checking accesses to chunk, but not to v; i think if we did

Suggested change
for (i, b) in chunk.iter().enumerate() {
v[i] = *b;
for (b, v_i) in chunk.iter().zip(v.iter_mut()) {
*v_i = *b;

we might elide both bounds checks?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i had to go godbolt this because i have autism or something, and yeah, the zippy one is nicer: https://godbolt.org/z/3avMbshn5

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants