cpu/rp2350_common/periph/uart: fix PL011 FIFO handling#22269
Conversation
|
@Teufelchen1 Done, sorry about the friction. Both your comment and @AnnsAnns's landed while all three PRs were still in draft; I'd marked them ready for review at almost the same time without realising the comments had come in. The descriptions on all three PRs are now rewritten using Explicit AI declaration: this was done with Claude Code (Anthropic, claude-opus-4-7) in agent mode with my review on every commit. The PL011 fixes were checked against the RP2350 datasheet and PL011 TRM; the 1000-round soak test was run on real Pi Pico 2 H hardware before opening the PR. Happy to take any further changes. |
| * (~87 us at 115200 baud) causes an overrun and the byte is silently | ||
| * dropped. With FIFOs on, RX gets a 32-byte buffer, eliminating drop | ||
| * during normal stdio printing or concurrent TX bursts. Required for | ||
| * reliable two-board UART framing on RP2350. */ |
There was a problem hiding this comment.
I assume this code is based on 12.1.7?
There was a problem hiding this comment.
Actually it's §12.1.3 (Operation, FIFO vs character mode behavior) and §12.1.8 (UARTLCR_H register description) for the FEN bit, not §12.1.7. Citations are in the comment now.
| /* Counter of single-byte UART RX errors (parity/break/framing). The | ||
| * original driver printed via puts() inside the ISR — at 115200 baud | ||
| * that print takes ~4 ms, far longer than the inter-byte budget, and | ||
| * stalled the ISR so badly that the next bytes were silently lost. | ||
| * We now (a) count instead of print, and (b) PASS the byte through | ||
| * to the rx_cb anyway — keeping the byte stream aligned (bit-level | ||
| * corruption is recoverable at protocol layer; byte loss is much | ||
| * harder to recover from because it desyncs framing). */ | ||
| volatile uint32_t rp2350_uart_rx_error_count; |
There was a problem hiding this comment.
I don't full understand this, did you receive that many writes of that debug put?
I generally agree that this put should have probably been a debug print but I have not received this print a single time (as can be seen by the fact that I even forgot to change it to [rp2350]).
Are you doing something specific to trigger that print so often?
There was a problem hiding this comment.
Let me re-state more precisely what I have evidence for and what I don't.
Empirical: in a 1000-round mutual-attestation soak (1534 s, ~17.7 M received bytes between two Pi Pico 2 H boards at 115200 baud) rp2350_uart_rx_error_count reached 1. That counts the PL011 hardware raising a BE / PE / FE flag in UARTDR for one byte. All 999 completed rounds passed; the error was absorbed at the protocol layer (the corrupt byte was passed through to the rx_cb and either fell into a non-critical position or its frame's signature check rejected and retried, I do not have framing-level traces from the soak to distinguish those two). I have not investigated the root cause of that single event.
What I therefore do not claim: I don't have a per-round rate strong enough to call this rare or common. I haven't constructed a case that fires it on demand. The single event might be line noise, a clock-drift glitch, or anything else, I don't know. So your observation that the print never fires in your testing is consistent with mine: under any reasonable test the trigger condition is at least extremely rare, but I'm not making a quantitative claim about its rate.
What I am claiming, and where the change comes from: if the trigger fires at any rate, the original puts() path has a consequence that the counter-based path does not. The original puts() string "[rpx0xx] uart RX error (parity, break, or framing error" plus the trailing newline is 56 bytes. At 115200 baud (10 bits per byte, ~87 µs per byte), writing 56 bytes from the RX ISR blocks for ~4.9 ms. While the ISR is blocked the peer keeps clocking bytes in at 87 µs each, about 56 more bytes, but the PL011 RX FIFO holds 32 bytes with FEN on or 1 byte with FEN off. So a single fire under the original code would cause FIFO overflow and silent loss of ~20 to 56 subsequent bytes.
The byte that triggered the event is already corrupt (one bit on the wire was wrong); bit-level corruption is recoverable at the protocol layer via the next signature/CRC check. Byte loss of the next ~56 bytes is harder: a length-prefixed frame that expected 2696 bytes but only received 2640 desyncs the receiver's parser for everything that follows, and depending on the protocol may require resync logic or never recover. So under the original code one rare BE/PE/FE event amplifies a 1-bit corruption into ~56 bytes of frame-desyncing loss; under the counter-based code the same event increments a counter, passes the (still-corrupt) byte through to rx_cb, the application's integrity check rejects the frame, the sender retries, and the bounded-time ISR contract is preserved.
Cost of the fix: an atomic increment plus a register read, both ISR-safe and bounded-time. The rp2350_uart_rx_error_count symbol is volatile and exposed for diagnostic sampling without printing anything from the ISR.
I'm happy to run a longer test (10,000 rounds, ~4 hours) if you'd like a stronger data point on the rate, but the argument above doesn't depend on the rate, only on the consequence per fire, so I haven't done it preemptively. The comment block above the counter has been updated to phrase this as preventive-not-reactive (the previous wording read as though I had observed the stall, which I had not).
Six fixes to the RP2350 UART driver addressing byte loss and ISR stalls at 115200 baud during sustained two-board UART traffic. The RP2350 UART is derived from ARM PrimeCell PL011, so the underlying bit semantics are the standard PL011 ones; the fixes are RP2350- specific only in that this driver implements them for the RP2350 BSP. They would apply to any PL011-derived UART driver. - Enable RX/TX FIFOs (FEN bit in UARTLCR_H). Without FEN the UART is in character mode with a 1-byte holding register; any RX ISR latency above one byte time (~87 us at 115200) overruns and drops a byte. - Enable RX timeout interrupt (RTIM in UARTIMSC) so trailing sub-trigger bytes are delivered promptly. - Drain the entire RX FIFO per ISR invocation. - Replace puts() in the ISR error path with a volatile counter (rp2350_uart_rx_error_count). puts() at 115200 baud is ~4 ms, far longer than the inter-byte budget, so a single fire would stall the ISR enough to overrun the next several bytes. This is a preventive change rather than a fix for an observed bug. - Fix TX polling: use UARTFR.TXFF + UARTFR.BUSY instead of UARTRIS.TXRIS (which is an edge condition that may never occur for short writes). - Add register-level documentation comments with RP2350 datasheet section citations (12.1.3, 12.1.6, 12.1.8). No public API change. Tested with continuous UART crosswire traffic at 115200 baud between two Pi Pico 2 H boards (1000-round soak, zero application- level failures, one hardware-flagged byte error transparently absorbed via the counter).
2026455 to
be7eeda
Compare
Correct the byte-count and timing in the error-counter comment block (56 bytes / ~4.9 ms instead of 46 / ~4 ms) so the numbers match the actual original puts() string length.
|
Hello, once again thank you for being wilful to contribute to RIOT, we do appreciate it. Due to various reasons, such as the currently unresolved legal issues around AI code combined with the very high degree of AI assistance, among other things, we have decided that we can not accept this PR as of now. This is not meant to stop you from contributing to RIOT. For example, if you want to PR code written by yourself, we are open to review and help you. |
|
On a personal note, esp. the fix that enabled the FIFO for the UART would be quite nice to have. If you are willing to do so, I'd highly appreciate a PR for that written by yourself. Thank you for noticing that 😄 |
Contribution description
Six fixes to the RP2350 PL011 UART driver in
cpu/rp2350_common/periph/uart.c. The bugs being fixed cause byteloss and ISR stalls at 115200 baud during sustained traffic; they
are not RP2350-specific PL011 quirks (they would apply to any
PL011-using BSP) but the symptoms are visible enough on RP2350 in
two-board UART setups to be worth fixing here.
PL011 has only a 1-byte receive holding register; any RX ISR
latency above one byte time (~87 us at 115200 baud) silently
overruns. With FIFOs on, RX gets a 32-byte buffer.
fires when the RX FIFO is non-empty but below the trigger level
for ~32 baud-clocks. Without it, the trailing < trigger bytes
of a burst sit in the FIFO until enough new bytes arrive, which
never happens at the end of a frame.
ISR read one byte and returned, leaving any remaining bytes in
the FIFO until the next interrupt edge.
puts()in the ISR error path with a volatilecounter.
puts()at 115200 baud takes ~4 ms, far longer thanthe inter-byte budget, and stalled the ISR enough to drop
subsequent bytes. The new counter (
rp2350_uart_rx_error_count)is exposed for diagnostics and the byte is passed through to the
rx_cbso framing stays aligned.UARTRIS.TXRIS,which in FIFO mode signals "FIFO crossed the trigger threshold
downward", an edge condition that may never occur for short
writes, hanging the loop. Replaced with
UARTFR.TXFF(wait whilefull) per byte plus
UARTFR.BUSY(wait until idle) after thelast byte.
the non-obvious bits (FEN, RTIM, TXFF/BUSY semantics, the
puts-in-ISR anti-pattern).
No public API change; no header churn; only the RP2350 BSP file is
touched.
Testing procedure
Two Pi Pico 2 H boards wired with a UART crosswire (each board's
GP8/GP9 to the other's GP9/GP8) and a USB-UART bridge for stdio
observation. An application-level mutual-attestation harness
exchanged 2696-byte signed quotes between the two boards at
115200 baud, 1000 round-trips over a 1534 s run. Result: every
completed round passed (the 1 round counted as not-OK was simply
in-flight when the test stopped). One byte-level RX error was
flagged by the PL011 hardware (FE/PE/BE bit set in
UARTDR)across ~17.7 million received bytes; with the new ISR (counter +
pass-through), that byte was delivered to the application and
absorbed transparently: i.e., zero application-level retries or
failures.
On master (without these fixes), the same harness drops bytes
within a handful of rounds at 115200 baud. The simplest
reproduction without the harness is any back-to-back two-board
UART exchange that issues a
printffrom inside the RX ISR path -which the original driver did via
puts()on parity / framingerrors.
Before this PR is applied: the PL011 driver in master drops bytes
under sustained 115200-baud RX load and can hang the TX poll loop
on short writes (see the symptoms enumerated under "Contribution
description" above). With the PR applied: every application using
the RP2350 UART at 115200 baud should be reliable, and the
rp2350_uart_rx_error_countsymbol is available for diagnosticintrospection. No public API change.
tests/periph/uartcannot currently be used as a test vehicle onrpi-pico-2-armbecause the BSP does not yet provideperiph_timer(whichztimer_msec, used by that test, dependson). Once #22270 lands, that test path will be available as an
additional validation route.
Issues/PRs references
Companion PRs from the same investigation, independent merge order:
pkg/pqm4: ML-DSA-44 (FIPS 204) packagecpu/rp2350_common:periph_timerdriver. Once thatlands,
tests/periph/uartwill build forBOARD=rpi-pico-2-armand provide an additional CI-side validation path for these
fixes.
Declaration of AI-Tools / LLMs usage:
AI-Tools / LLMs that were used are:
review for code drafting, PL011 datasheet cross-referencing,
PR description authoring, and test orchestration. The original
uart.cis by Tom Hert (HAW Hamburg): these are diagnosticfixes layered on top. All six fixes were chosen and reviewed by
me; every register-level claim was checked against the RP2350
datasheet and the PL011 TRM; the 1000-round soak test was run on
real hardware before the PR was opened.