Cosmo Host flash performance improvements #2293

labbott · 2025-11-20T15:23:32Z

No description provided.

drv/cosmo-hf/src/hf.rs

mkeeter · 2025-11-20T16:10:58Z

drv/cosmo-hf/src/main.rs

+        let cnt = SECTOR_SIZE_BYTES / (PAGE_SIZE_BYTES as u32);
+        for i in 0..cnt {
+            let addr = addr.0 + i * (PAGE_SIZE_BYTES as u32);
+            self.clear_fifos();


I think you could move everything except self.drv.addr.set_addr(addr) out of this loop, dunno how much of a performance impact that would have.

What actually triggers the start of the transaction? is it the addr or the OpCode? I'm not a huge fan of breaking this up, especially since there doesn't seem to be a notable increase in performance.

Yeah, if we're going to do this, it seems worth consulting the datasheet to be sure that the other configurations aren't cleared by the transaction completing?

IIRC writing the OpCode is what triggers the start of the transaction. The other fields are internal to the FPGA, so we're the ones that define their behavior (e.g. whether they clear), but I don't feel strongly about saving a few writes.

mkeeter · 2025-11-20T16:13:57Z

drv/cosmo-hf/src/main.rs

-            }
-            self.drv.instr.set_opcode(instr::QUAD_INPUT_PAGE_PROGRAM_4B);
-            self.wait_fpga_busy();
+    /// This function will only return an error if the slice len is not a full


It makes sense that this now takes a slice, but I'm not sure why this PR changes it to only accept a single page. Was the overhead of breaking the slice into page chunks too much? That seems implausible, but if that's the case, the signature should change to take &[u8; PAGE_SIZE] to enforce correctness on the callers.

Hmmm, now that you mentioned it I don't think this is correct for the persistent case since that's less than a page. page_program is limited to a page size and I don't think the current loop correctly implements the address for more than a page https://github.com/oxidecomputer/hubris/blob/master/drv/cosmo-hf/src/main.rs#L326-L354 (unless I've misread something)

I agree that the lack of address updates seems wrong in the existing code 😅

Since persistent data writes < 1 page, how about fixing that in the existing (arbitrary-length) write?

I think I miss what "that" is referring to here but I was going to relax the requirement to be limited to a single page write. I think most of the code elsewhere is already breaking things down in to page sizes so it seems redundant to do that here too.

My suggestion is to allow flash_write to write arbitrary-length buffers (i.e. the previous implementation), but fix the bug that you noticed where addr isn't incremented between pages.

This has the advantage that the function would become infallible, so callers wouldn't need to think about error handling. If callers are already splitting stuff up into page-size transactions, this would add a tiny bit of overhead, but save an unwrap / unwrap_lite.

aapoalas · 2025-11-20T17:07:51Z

The tlvc PR I've had open for a while might be worth returning to for Cosmo perf work: it's probably not a large effect on performance one way or the other, but it does probably remove panics so that's always nice.

I also have done some work on the Cosmo host flash APIs in a branch to remove panics here; I think it would require a little bit of more work to make it worth a PR, but it did also remove some panics so that's nice. (I think I also found one bug IIRC.) If you'd be willing to review it, I can take a look at freshening up that work.

labbott · 2025-11-20T17:17:24Z

I also have done some work on the Cosmo host flash APIs in a branch to remove panics here; I think it would require a little bit of more work to make it worth a PR, but it did also remove some panics so that's nice. (I think I also found one bug IIRC.) If you'd be willing to review it, I can take a look at freshening up that work.

It's a little hard to say without actually seeing the code. I'm not opposed to removing panics (I've sometimes had a war against unwrap) but it can also be a trade off for having the code be harder to follow.

aapoalas · 2025-11-20T17:30:43Z

It's a little hard to say without actually seeing the code. I'm not opposed to removing panics (I've sometimes had a war against unwrap) but it can also be a trade off for having the code be harder to follow.

The commits is here but it's currently full of wrapping arithmetic to remove arithmetic panics that the compiler couldn't figure out, together with comments to explain why there's no chance of an overflow/wrap there: this is "correct" but not actually correct. Freshening this up would consist of replacing the wrapping arithmetic with such code transformations, together with checked or unchecked arithmetic as necessary, to properly explain the intended semantics of the code to the compiler, to enable it to remove the panics on its own and unsafely assert them when it can't.

drv/cosmo-hf/src/main.rs

hawkw · 2025-11-20T18:00:16Z

drv/cosmo-hf/src/main.rs

+            // We need to complete the read of the full PAGE_SIZE
+            // even if we find a non-erased byte


Is that to ensure that the transaction has completed before trying to do something else?

Yes, I was seeing what looked like incorrect behavior (mismatch of written vs expected data) without completing the full PAGE_SIZE read.

maybe worth noting that a little more explicitly in this comment?

is that possibly because you just need to bleed down the FIFOs? The hardware here will stuff the remainign bytes into the FIFOs. If you have waited for the transaction to finish, you could optionally use the fifo clear bits to nuke them? This is basically a "you asked for PAGE_SIZE bytes, so you get PAGE_SIZE bytes into the response FIFO" I think

so the loop had self.clear_fifos() each time which I thought should be clearing everything. I'm also now questioning if this was actually incorrect given I also spent too long dealing with oxidecomputer/humility#585 . Perhaps I will check again...

I tried this again, I think I was missing another wait_for_fpga for the FPGA to actually be finished processing after draining the FIFO. With that in we don't actually see any difference in exiting out early vs just completing all the reads, probably because it still takes just as much time to drain. I'll tweak the comment.

drv/cosmo-hf/src/main.rs

aapoalas · 2025-11-20T22:28:15Z

I opened #2296 from the perf work I had done earlier, fixed up to not use wrapping arithmetic all over the place.

With those changes in place and the tlvc PR in use as well, there are only a handful of panics or unwraps in the cosmo-b-dev/dist/auxflash and at least none in any direct hot loops that I can see. Maybe some of that stuff is useful here?

The Winbond flash chip on Cosmo has a unique flash erase algorithm that significantly penalizes erasing already erased sectors. Check for this case before running a sector erase command. Before real 3m31.962s user 0m4.325s sys 0m3.188s After real 2m0.817s user 0m4.380s sys 0m3.305s

Flash memory is set to all `1` when erased. We have to erase pages before writing so writing pages full of `1` is not a good use of CPU time. Before real 2m0.817s user 0m4.380s sys 0m3.305s After real 1m44.543s user 0m3.586s sys 0m2.710s

nathanaelhuffman

ty for the work here!

mkeeter · 2025-11-21T21:09:57Z

drv/cosmo-hf/src/main.rs

+            self.drv.dummy_cycles.set_count(8);
+            self.drv.instr.set_opcode(instr::FAST_READ_QUAD_OUTPUT_4B);
+            let mut erased = true;
+            // Technicaly we could terminiate this loop early when


nit:

Suggested change

// Technicaly we could terminiate this loop early when

// Technically we could terminate this loop early when

mkeeter · 2025-11-21T21:10:13Z

drv/cosmo-hf/src/main.rs

+    /// This function will only return an error if the slice len is greater than
+    /// a page


Suggested change

/// This function will only return an error if the slice len is greater than

/// a page

hawkw · 2025-11-21T21:01:27Z

drv/cosmo-hf/src/main.rs

+            // Technicaly we could terminiate this loop early when
+            // we find the first non-erased byte but we still have
+            // to drain the FIFOs and wait for the FPGA which means
+            // there's no performance gain vs just reading everything
+            // in a loop here.


thanks for adding this, this is a lot clearer. i noticed a couple typos and grammar nits:

Suggested change

// Technicaly we could terminiate this loop early when

// we find the first non-erased byte but we still have

// to drain the FIFOs and wait for the FPGA which means

// there's no performance gain vs just reading everything

// in a loop here.

// Technically, we could terminate this loop early when

// we find the first non-erased byte, but we still have

// to drain the FIFOs and wait for the FPGA, which means

// there's no performance gain vs just reading everything

// in a loop here.

hawkw · 2025-11-21T21:04:09Z

drv/cosmo-hf/src/main.rs

+                for (i, b) in chunk.iter().enumerate() {
+                    v[i] = *b;


suuuuper nitpick, please feel free to ignore me: i think that iter().enumerate() over chunk elides bounds-checking accesses to chunk, but not to v; i think if we did

Suggested change

for (i, b) in chunk.iter().enumerate() {

v[i] = *b;

for (b, v_i) in chunk.iter().zip(v.iter_mut()) {

*v_i = *b;

we might elide both bounds checks?

i had to go godbolt this because i have autism or something, and yeah, the zippy one is nicer: https://godbolt.org/z/3avMbshn5

labbott changed the title ~~Host flash performance improvements~~ Cosmo Host flash performance improvements Nov 20, 2025

labbott force-pushed the hf_perf_round_2 branch from 29a4964 to a3c2adb Compare November 20, 2025 15:40

labbott requested review from hawkw, mkeeter and nathanaelhuffman November 20, 2025 15:44

mkeeter reviewed Nov 20, 2025

View reviewed changes

drv/cosmo-hf/src/hf.rs Outdated Show resolved Hide resolved

mkeeter reviewed Nov 20, 2025

View reviewed changes

hawkw reviewed Nov 20, 2025

View reviewed changes

labbott force-pushed the hf_perf_round_2 branch from a3c2adb to 1606f27 Compare November 20, 2025 18:22

labbott force-pushed the hf_perf_round_2 branch from 1606f27 to 18b1754 Compare November 21, 2025 19:28

Skip programming erased pages

a3be551

Flash memory is set to all `1` when erased. We have to erase pages before writing so writing pages full of `1` is not a good use of CPU time. Before real 2m0.817s user 0m4.380s sys 0m3.305s After real 1m44.543s user 0m3.586s sys 0m2.710s

labbott force-pushed the hf_perf_round_2 branch from 18b1754 to a3be551 Compare November 21, 2025 19:30

nathanaelhuffman approved these changes Nov 21, 2025

View reviewed changes

mkeeter reviewed Nov 21, 2025

View reviewed changes

mkeeter approved these changes Nov 21, 2025

View reviewed changes

hawkw reviewed Nov 21, 2025

View reviewed changes

hawkw approved these changes Nov 21, 2025

View reviewed changes

		// We need to complete the read of the full PAGE_SIZE
		// even if we find a non-erased byte

	// Technicaly we could terminiate this loop early when
	// Technically we could terminate this loop early when

		/// This function will only return an error if the slice len is greater than
		/// a page

Cosmo Host flash performance improvements #2293

Are you sure you want to change the base?

Cosmo Host flash performance improvements #2293

Uh oh!

Conversation

labbott commented Nov 20, 2025

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

aapoalas commented Nov 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

labbott commented Nov 20, 2025

Uh oh!

aapoalas commented Nov 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

nathanaelhuffman Nov 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

aapoalas commented Nov 20, 2025

Uh oh!

nathanaelhuffman left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

aapoalas commented Nov 20, 2025 •

edited

Loading

aapoalas commented Nov 20, 2025 •

edited

Loading

nathanaelhuffman Nov 20, 2025 •

edited

Loading