-
Notifications
You must be signed in to change notification settings - Fork 210
Cosmo Host flash performance improvements #2293
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
29a4964 to
a3c2adb
Compare
| let cnt = SECTOR_SIZE_BYTES / (PAGE_SIZE_BYTES as u32); | ||
| for i in 0..cnt { | ||
| let addr = addr.0 + i * (PAGE_SIZE_BYTES as u32); | ||
| self.clear_fifos(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you could move everything except self.drv.addr.set_addr(addr) out of this loop, dunno how much of a performance impact that would have.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What actually triggers the start of the transaction? is it the addr or the OpCode? I'm not a huge fan of breaking this up, especially since there doesn't seem to be a notable increase in performance.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, if we're going to do this, it seems worth consulting the datasheet to be sure that the other configurations aren't cleared by the transaction completing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IIRC writing the OpCode is what triggers the start of the transaction. The other fields are internal to the FPGA, so we're the ones that define their behavior (e.g. whether they clear), but I don't feel strongly about saving a few writes.
drv/cosmo-hf/src/main.rs
Outdated
| } | ||
| self.drv.instr.set_opcode(instr::QUAD_INPUT_PAGE_PROGRAM_4B); | ||
| self.wait_fpga_busy(); | ||
| /// This function will only return an error if the slice len is not a full |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It makes sense that this now takes a slice, but I'm not sure why this PR changes it to only accept a single page. Was the overhead of breaking the slice into page chunks too much? That seems implausible, but if that's the case, the signature should change to take &[u8; PAGE_SIZE] to enforce correctness on the callers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmmm, now that you mentioned it I don't think this is correct for the persistent case since that's less than a page. page_program is limited to a page size and I don't think the current loop correctly implements the address for more than a page https://github.com/oxidecomputer/hubris/blob/master/drv/cosmo-hf/src/main.rs#L326-L354 (unless I've misread something)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that the lack of address updates seems wrong in the existing code 😅
Since persistent data writes < 1 page, how about fixing that in the existing (arbitrary-length) write?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I miss what "that" is referring to here but I was going to relax the requirement to be limited to a single page write. I think most of the code elsewhere is already breaking things down in to page sizes so it seems redundant to do that here too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My suggestion is to allow flash_write to write arbitrary-length buffers (i.e. the previous implementation), but fix the bug that you noticed where addr isn't incremented between pages.
This has the advantage that the function would become infallible, so callers wouldn't need to think about error handling. If callers are already splitting stuff up into page-size transactions, this would add a tiny bit of overhead, but save an unwrap / unwrap_lite.
|
The tlvc PR I've had open for a while might be worth returning to for Cosmo perf work: it's probably not a large effect on performance one way or the other, but it does probably remove panics so that's always nice. I also have done some work on the Cosmo host flash APIs in a branch to remove panics here; I think it would require a little bit of more work to make it worth a PR, but it did also remove some panics so that's nice. (I think I also found one bug IIRC.) If you'd be willing to review it, I can take a look at freshening up that work. |
It's a little hard to say without actually seeing the code. I'm not opposed to removing panics (I've sometimes had a war against |
The commits is here but it's currently full of wrapping arithmetic to remove arithmetic panics that the compiler couldn't figure out, together with comments to explain why there's no chance of an overflow/wrap there: this is "correct" but not actually correct. Freshening this up would consist of replacing the wrapping arithmetic with such code transformations, together with checked or unchecked arithmetic as necessary, to properly explain the intended semantics of the code to the compiler, to enable it to remove the panics on its own and unsafely assert them when it can't. |
drv/cosmo-hf/src/main.rs
Outdated
| // We need to complete the read of the full PAGE_SIZE | ||
| // even if we find a non-erased byte |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is that to ensure that the transaction has completed before trying to do something else?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I was seeing what looked like incorrect behavior (mismatch of written vs expected data) without completing the full PAGE_SIZE read.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe worth noting that a little more explicitly in this comment?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is that possibly because you just need to bleed down the FIFOs? The hardware here will stuff the remainign bytes into the FIFOs. If you have waited for the transaction to finish, you could optionally use the fifo clear bits to nuke them? This is basically a "you asked for PAGE_SIZE bytes, so you get PAGE_SIZE bytes into the response FIFO" I think
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so the loop had self.clear_fifos() each time which I thought should be clearing everything. I'm also now questioning if this was actually incorrect given I also spent too long dealing with oxidecomputer/humility#585 . Perhaps I will check again...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried this again, I think I was missing another wait_for_fpga for the FPGA to actually be finished processing after draining the FIFO. With that in we don't actually see any difference in exiting out early vs just completing all the reads, probably because it still takes just as much time to drain. I'll tweak the comment.
a3c2adb to
1606f27
Compare
|
I opened #2296 from the perf work I had done earlier, fixed up to not use wrapping arithmetic all over the place. With those changes in place and the tlvc PR in use as well, there are only a handful of panics or unwraps in the cosmo-b-dev/dist/auxflash and at least none in any direct hot loops that I can see. Maybe some of that stuff is useful here? |
The Winbond flash chip on Cosmo has a unique flash erase algorithm that significantly penalizes erasing already erased sectors. Check for this case before running a sector erase command. Before real 3m31.962s user 0m4.325s sys 0m3.188s After real 2m0.817s user 0m4.380s sys 0m3.305s
1606f27 to
18b1754
Compare
Flash memory is set to all `1` when erased. We have to erase pages before writing so writing pages full of `1` is not a good use of CPU time. Before real 2m0.817s user 0m4.380s sys 0m3.305s After real 1m44.543s user 0m3.586s sys 0m2.710s
18b1754 to
a3be551
Compare
nathanaelhuffman
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ty for the work here!
| self.drv.dummy_cycles.set_count(8); | ||
| self.drv.instr.set_opcode(instr::FAST_READ_QUAD_OUTPUT_4B); | ||
| let mut erased = true; | ||
| // Technicaly we could terminiate this loop early when |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit:
| // Technicaly we could terminiate this loop early when | |
| // Technically we could terminate this loop early when |
| /// This function will only return an error if the slice len is greater than | ||
| /// a page |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| /// This function will only return an error if the slice len is greater than | |
| /// a page |
| // Technicaly we could terminiate this loop early when | ||
| // we find the first non-erased byte but we still have | ||
| // to drain the FIFOs and wait for the FPGA which means | ||
| // there's no performance gain vs just reading everything | ||
| // in a loop here. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks for adding this, this is a lot clearer. i noticed a couple typos and grammar nits:
| // Technicaly we could terminiate this loop early when | |
| // we find the first non-erased byte but we still have | |
| // to drain the FIFOs and wait for the FPGA which means | |
| // there's no performance gain vs just reading everything | |
| // in a loop here. | |
| // Technically, we could terminate this loop early when | |
| // we find the first non-erased byte, but we still have | |
| // to drain the FIFOs and wait for the FPGA, which means | |
| // there's no performance gain vs just reading everything | |
| // in a loop here. |
| for (i, b) in chunk.iter().enumerate() { | ||
| v[i] = *b; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
suuuuper nitpick, please feel free to ignore me: i think that iter().enumerate() over chunk elides bounds-checking accesses to chunk, but not to v; i think if we did
| for (i, b) in chunk.iter().enumerate() { | |
| v[i] = *b; | |
| for (b, v_i) in chunk.iter().zip(v.iter_mut()) { | |
| *v_i = *b; |
we might elide both bounds checks?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i had to go godbolt this because i have autism or something, and yeah, the zippy one is nicer: https://godbolt.org/z/3avMbshn5
No description provided.