Bluetooth: Controller: Fix Extended Scanning for slow CPUs #98267

cvinayak · 2025-10-26T16:10:49Z

Add implementation to gracefully handle failure due to high Radio ISR latencies due to higher CPU usage in LLL execution context just before the handling of Radio IRQs. Higher CPU time usage in LLL execution context can be due to slow CPUs, example, nRF51x SoC Series with 16 MHz ARM Cortex-M0 CPU.

cvinayak · 2025-10-26T16:17:05Z

This pull request introduces a robust handling mechanism for radio ISR latency issues during auxiliary PDU reception setup in the Bluetooth controller. It adds logic to detect and gracefully handle missed hard realtime deadlines, reducing the risk of assertion failures and improving system resilience on slower CPUs.

Improvements to radio ISR latency handling:

Added a countdown mechanism (latency_skip_countdown) to tolerate up to three consecutive radio setup deadline misses before triggering an assertion, allowing for graceful recovery from transient latency spikes.
Implemented logic to gracefully close the radio event and resume scanning if the setup deadline is missed, instead of immediately asserting, which helps maintain system stability.
Enhanced debugging and profiling support by providing verbose assertion messages when profiling is enabled, aiding in diagnosing persistent latency issues.

cvinayak · 2025-10-26T16:17:14Z

Please analyse the pull request, give a summary, any possible regression failures or drawbacks of it?

Summary

This draft PR (98267) changes subsys/bluetooth/controller/ll_sw/nordic/lll/lll_scan_aux.c to handle failures when radio_tmr_start_us() does not return the expected start time (start_us != aux_start_us + 1). The motivation is high Radio ISR latency on slow CPUs (e.g., nRF51 Cortex-M0 @ 16 MHz) that causes missed hard realtime deadlines for auxiliary PDU reception.
Instead of immediately asserting on a missed deadline, the patch:
- Introduces a static latency_skip_countdown (initialized from latency_skip_max = 3) to allow a few consecutive misses.
- On a miss while countdown > 0: decrements the countdown, gracefully closes the radio event, allocates a NODE_RX_TYPE_EXT_AUX_RELEASE and schedules it to ULL, sets radio ISR to resume primary scanning, calls radio_disable(), and returns.
- If the countdown reaches zero: triggers an assertion. If CONFIG_BT_CTLR_PROFILE_ISR is enabled it includes profiling data in the assert, otherwise it does LL_ASSERT_ERR(false).
- When radio_tmr_start_us() succeeds, it resets latency_skip_countdown to the max.
Net effect: tolerate a small number of consecutive radio setup misses (gracefully close and continue scanning) instead of immediate hard assert/failure.

Possible regression failures and drawbacks

Lost auxiliary PDUs / incomplete advertiser data
- Gracefully closing the auxiliary event on misses means auxiliary PDUs will be skipped. This can lead to lost or incomplete extended advertising data (longer AD structures carried in Aux PDUs). Higher-level consumers expecting full extended advertising data might see missing fragments or incomplete data streams.
Masking intermittent timing problems
- Previously an immediate assert on start_us mismatch flagged the timing issue deterministically. Allowing up to N skips can hide intermittent timing regressions, making debugging and regression detection harder (failures may happen silently until the skip budget is exhausted).
Static countdown is global to the function (shared across contexts)
- latency_skip_countdown is a static local variable with file/function scope; it is not per-LLL instance. If multiple scan contexts (multiple lll instances) or multiple radios exist, the countdown is shared across all invocations of this function, which may be incorrect:
  - One scanner/radio experiencing a miss will decrement the shared counter and affect other scanners.
  - Better behavior would be a per-lll (per-context) counter so missed deadlines are tracked per-scan-instance.
Potential concurrency / SMP concerns
- The static countdown variable is not protected by atomic operations. If this code can run on an SMP system or be re-entered from different contexts (rare for these SoCs but possible in general builds), there could be data races on latency_skip_countdown. Even if LLL is only on one CPU in typical Nordic platforms, this is an assumption that reduces generality.
Race about lll->lll_aux usage
- The patch sets node_rx2->rx_ftr.param = lll and node_rx2->rx_ftr.lll_aux = lll->lll_aux and comments about the ULL assigning lll->lll_aux later. This relies on assumptions about ordering and on the lifetime of lll and lll->lll_aux:
  - If ULL later changes or frees lll->lll_aux, the pointer stored in node_rx2 may become stale.
  - The patch's comment acknowledges the race but assumes ull_scan_aux_release will handle it correctly; this must be validated across all race scenarios (especially corner cases where ULL assigns lll->lll_aux concurrently).
  - If ull_scan_aux_release dereferences lll_aux under certain races, there is a risk of use-after-free or null/dangling pointer access.
Assertion behavior and test visibility
- When countdown is exhausted the patch still asserts (LL_ASSERT_MSG or LL_ASSERT_ERR). That means on persistent timing failures the behavior remains a hard stop — which preserves safety but may be disruptive in production. If profiling is enabled, it produces more verbose asserts, but otherwise the assert is uninformative.
- For systems where a softer failure handling is preferred (e.g., degrade scanning instead of assert), this may be undesirable.
Allocation on ISR path
- The code allocates node_rx2 = ull_pdu_rx_alloc() in the LLL/ISR flow. Depending on how ull_pdu_rx_alloc is implemented, doing allocations in this execution context could be problematic (blocking, failure under low memory). The code does LL_ASSERT_ERR(node_rx2) — which would assert if allocation fails. This was also present in the patch and is a point to watch.
Test coverage gaps
- This fix targets slow CPU targets (e.g., nRF51). If not tested across:
  - Multiple simultaneous scanners,
  - SMP-enabled builds,
  - Platforms with different radio driver timing semantics,
    then regressions could appear only on those configurations.

Impact and severity

Functional impact: medium — can cause missed extended advertising data fragments or occasional missed Aux PDUs; may be acceptable if the intended trade-off is to avoid hard faults on slow CPUs.
Stability impact: low–medium — the change avoids immediate asserts on occasional misses, but shared/static counter and races could cause unexpected behavior across contexts.
Performance impact: minimal additional runtime overhead, but potential correctness impact as described.

Notes / suggestions (for consideration)

Make the skip counter per-LLL/context (e.g., store in the lll struct) rather than a single static so misses are tracked per-scan instance.
Consider using an atomic/lock or ensure LLL exclusivity guarantees that protect latency_skip_countdown if SMT/SMP is possible.
Add telemetry/logging (CONFIG_BT_CTLR_PROFILE_ISR-like or debug logs) for the skipped events to make intermittent timing issues visible without asserting immediately.
Add tests that exercise extended scanning on slow CPU targets and multi-scanner scenarios to verify no dangling pointer or ULL race conditions.
Re-evaluate doing memory allocation in the LLL/ISR context; if necessary, move to a deferred action or ensure allocation is non-blocking and safe.

Overall assessment

The change is a pragmatic mitigation to prevent hard failures on slow CPUs by gracefully closing auxiliary radio events on occasional timing misses, while still asserting on repeated failures. It reduces immediate crash risk on slow/higher-latency hardware but introduces potential semantic changes (skipped Aux PDUs), a shared static counter that may be wrong for multi-context use, and possible race conditions that should be validated with targeted tests.

Copilot

Pull Request Overview

This PR adds graceful handling for high Radio ISR latencies on slower CPUs (e.g., nRF51x with 16 MHz ARM Cortex-M0) during extended Bluetooth scanning. When the radio setup fails to meet hard real-time deadlines due to CPU limitations, the code now implements a retry mechanism with countdown before asserting, rather than immediately failing.

Key Changes:

Implements a countdown-based retry mechanism (up to 3 attempts) when radio setup timing fails
Adds graceful radio event cleanup paths for both auxiliary and non-auxiliary scanning contexts
Enhances assertion diagnostics with ISR latency information when profiling is enabled

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2025-10-26T19:01:11Z

subsys/bluetooth/controller/ll_sw/nordic/lll/lll_scan_aux.c

+	const uint8_t latency_skip_max = 3U;
+	static uint8_t latency_skip_countdown = latency_skip_max;


The static variable latency_skip_countdown is not thread-safe. In interrupt contexts or concurrent execution scenarios, multiple invocations could race on this shared state, leading to incorrect countdown behavior. Consider using atomic operations or re-architecting to avoid shared mutable state across ISR invocations.

Add implementation to gracefully handle failure due to high Radio ISR latencies due to higher CPU usage in LLL execution context just before the handling of Radio IRQs. Higher CPU time usage in LLL execution context can be due to slow CPUs, example, nRF51x SoC Series with 16 MHz ARM Cortex-M0 CPU. Signed-off-by: Vinayak Kariappa Chettimada <vich@nordicsemi.no>

cvinayak · 2025-10-27T16:41:42Z

This pull request introduces several improvements and bug fixes to the Bluetooth controller's low-latency features, configuration, and radio event handling. The changes focus on making low-latency options more robust, improving error handling for radio event setup, and cleaning up configuration dependencies.

Low-latency features and configuration:

Refactored Kconfig options for low-latency features: BT_CTLR_LOW_LAT_ULL and BT_CTLR_LOW_LAT_ULL_DONE now default to enabled when BT_CTLR_LOW_LAT is set, simplifying configuration and ensuring consistent behavior. Removed unnecessary select statements for better clarity. [1] [2]
Removed advanced feature and low-latency related options from overlay_bbc_microbit-bt_ll_sw_split.conf, cleaning up unused or redundant configuration lines.

Radio event handling and error resilience:

Improved auxiliary PDU radio setup in lll_scan_aux_isr_aux_setup: Added logic to gracefully handle and recover from missed hard real-time deadlines due to ISR latency, with a countdown mechanism and assertion checks to catch consecutive failures. This reduces the risk of system instability due to ISR timing issues.
Updated common_prepare_cb in lll_scan.c to only call lll_prepare_done when not resuming, ensuring correct event lifecycle management.

Low-latency ticker synchronization:

In ull_ticker_status_take, enabled mayfly synchronization between LLL and ULL when low-latency mode is active and priorities match, improving timing accuracy and reducing latency in ticker operations.

sonarqubecloud · 2025-10-27T19:28:16Z

Quality Gate passed

Issues
0 New issues
0 Accepted issues

Measures
0 Security Hotspots
0.0% Coverage on New Code
0.0% Duplication on New Code

See analysis details on SonarQube Cloud

Thalley · 2025-11-04T16:37:53Z

subsys/bluetooth/controller/ll_sw/nordic/lll/lll_scan_aux.c

+	 *       Detect the latency and gracefully close the radio event. If there is consecutive
+	 *       failure, then lets catch it in an assertion check.
+	 */
+	const uint8_t latency_skip_max = 3U;


Why the magic number 3 here?

Constant value (instead of define far away from locality of reference) to detect and assert consecutively skipped reception.

Should I have a Kconfig for this? value of greater than 0 for nRF51 and 0 for other faster SoCs?

I'm more interested in why the value is 3, rather than any other value. It's OK to have constants defined like this instead of #defines etc., but the actual value needs to be documented on why it is that specific value, unless it's obvious (which it isn't in this case IMO)

Thalley · 2025-11-04T16:39:13Z

subsys/bluetooth/controller/ll_sw/nordic/lll/lll_scan_aux.c

+			latency_skip_countdown--;
+
+			/* Gracefully close the radio event */
+			if (lll_aux) {


Suggested change

if (lll_aux) {

if (lll_aux != NULL) {

Thalley · 2025-11-04T16:39:37Z

subsys/bluetooth/controller/ll_sw/nordic/lll/lll_scan_aux.c

+				struct node_rx_pdu *node_rx2;
+
+				node_rx2 = ull_pdu_rx_alloc();
+				LL_ASSERT_ERR(node_rx2);


Suggested change

LL_ASSERT_ERR(node_rx2);

LL_ASSERT_ERR(node_rx2 != NULL);

cvinayak marked this pull request as ready for review October 26, 2025 17:00

zephyrbot added area: Bluetooth Controller area: Bluetooth labels Oct 26, 2025

zephyrbot requested review from Tronil, carlescufi, erbr-ot, ppryga, thoh-ot and wopu-ot October 26, 2025 17:02

zephyrbot assigned cvinayak Oct 26, 2025

cvinayak requested a review from Copilot October 26, 2025 19:00

Copilot AI reviewed Oct 26, 2025

View reviewed changes

cvinayak added the bug The issue is a bug, or the PR is fixing a bug label Oct 27, 2025

cvinayak added this to the v4.4.0 milestone Oct 27, 2025

cvinayak force-pushed the github_extended_scan_slow_cpu_fix branch from 0dbd7c4 to c37debb Compare October 27, 2025 16:02

zephyrbot added area: Bluetooth Host Bluetooth Host (excluding BR/EDR) area: Samples Samples labels Oct 27, 2025

zephyrbot requested review from HaavardRei, PavelVPV, Thalley, alwa-nordic, hermabe, jhedberg, rugeGerritsen and sjanc October 27, 2025 16:40

cvinayak force-pushed the github_extended_scan_slow_cpu_fix branch from ba17b0a to c3f2124 Compare October 27, 2025 19:22

This was referenced Oct 29, 2025

samples: Bluetooth: observer: Fix Extended Scanning on BBC Micro:bit #98358

Merged

Bluetooth: Controller: Place holder for a list of known assertions #82399

Open

Thalley reviewed Nov 4, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Bluetooth: Controller: Fix Extended Scanning for slow CPUs #98267

Bluetooth: Controller: Fix Extended Scanning for slow CPUs #98267

cvinayak commented Oct 26, 2025

Uh oh!

cvinayak commented Oct 26, 2025

Uh oh!

cvinayak commented Oct 26, 2025 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Oct 26, 2025

Uh oh!

cvinayak commented Oct 27, 2025

Uh oh!

sonarqubecloud bot commented Oct 27, 2025

Uh oh!

Thalley Nov 4, 2025

Uh oh!

cvinayak Nov 8, 2025

Uh oh!

Thalley Nov 10, 2025

Uh oh!

Thalley Nov 4, 2025

Uh oh!

Thalley Nov 4, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

		const uint8_t latency_skip_max = 3U;
		static uint8_t latency_skip_countdown = latency_skip_max;

Bluetooth: Controller: Fix Extended Scanning for slow CPUs #98267

Are you sure you want to change the base?

Bluetooth: Controller: Fix Extended Scanning for slow CPUs #98267

Conversation

cvinayak commented Oct 26, 2025

Uh oh!

cvinayak commented Oct 26, 2025

Uh oh!

cvinayak commented Oct 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Uh oh!

Copilot AI Oct 26, 2025

Choose a reason for hiding this comment

Uh oh!

cvinayak commented Oct 27, 2025

Uh oh!

sonarqubecloud bot commented Oct 27, 2025

Quality Gate passed

Uh oh!

Thalley Nov 4, 2025

Choose a reason for hiding this comment

Uh oh!

cvinayak Nov 8, 2025

Choose a reason for hiding this comment

Uh oh!

Thalley Nov 10, 2025

Choose a reason for hiding this comment

Uh oh!

Thalley Nov 4, 2025

Choose a reason for hiding this comment

Uh oh!

Thalley Nov 4, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

cvinayak commented Oct 26, 2025 •

edited

Loading