Feature: Alternate Manchester SWO implementation using DMA #1960

ssimek · 2024-10-13T10:17:46Z

Fast DMA-based SWO Manchester decoding

Detailed description

The current interrupt-based SWO decoding suffers (IMO) from several issues

it is limited to ~100 kHz (~50 kbit/s raw, i.e. ~25 kbit/s decoded SWO console) signal
- while this is usually sufficient for text-based debug output, it gets worse when more data needs to be sent out/graphed/etc
- it also unnecessarily impacts performance of the tested code, because the MCU has to wait for the ITM to be available
high frequency hangs the probe due to an overload of interrupt requests
- this happens easily when starting with new hardware before the init code is debugged

This PR contains a full reimplementation of the SWO Manchester decoder using DMA that makes capture a lot more efficient:

all edge times of the signal are captured using a timer
DMA is used to record the timings into a circular buffer
the buffer is periodically processed in batches, transformig the edge stream into a byte stream for sending in another circular buffer, resulting in effective processing time per sample on the order of several clock cycles
the output buffer is processed in a lower-priority ISR as time permits

The result is the ability to process SWO signal up to ~3 MHz with the probe being more resilient against higher frequency signals (it just fails to decode it properly). Continuous streaming at these speeds obviously makes USB a bottleneck (1.5 Mbit = ~180 kB/s, with the USB controller seemingly being able to handle up to about 70 kB/s without double buffering). But for reasonably intermittent flows, one big benefit is the lightened load on the target.

Your checklist for this pull request

I've read the Code of Conduct
I've read the guidelines for contributing to this repository
It builds for hardware native (see Building the firmware)
- it does, but it doesn't fit, neither did the original one, at least not with GCC 13.3 I'm using - it fits with LTO, but then the GDB server doesn't work for some reason. I was developing it with a few targets removed so it fit
It builds as BMDA (see Building the BMDA)
I've tested it to the best of my ability
My commit messages provide a useful short description of what the commits do

Closing issues

None that I'm aware of

dragonmux

We haven't got too far into reviewing the DMA code as the review notes we have are already going to be a fair bit of work and with them resolved reviewing the rest of the implementation will be easier, so we're submitting this as an early review.

dragonmux · 2024-10-13T10:26:13Z

src/platforms/common/stm32/swo.c

@@ -58,7 +58,6 @@ bool swo_itm_decoding = false;
 uint8_t *swo_buffer;
 uint16_t swo_buffer_read_index = 0U;
 uint16_t swo_buffer_write_index = 0U;
-_Atomic uint16_t swo_buffer_bytes_available = 0U;


This might have seemed redundant, but please consider what happens when the two indicies are equal to each other after the buffer fills up - this particular bug has happened when a target is particularly bursty at higher frequencies, and results in a full buffer of data getting dropped when just checking the indexes aren't equal. This is why this variable was introduced.

I have. It's not about redundancy - other than the (not insignificant) performance penalty, maintaining the extra variable introduces a higher chance of getting things wrong. When swo_buffer_bytes_available crossed the size of the buffer, it wasn't handled either - it just sent the buffered data twice. Just using the indexes on the other hand tends to "drop" the full buffer. The stream is corrupted in both cases, but I'd argue sending less data and recovering faster is preferable to sending more bad data.

src/platforms/common/stm32/swo_uart.c

dragonmux · 2024-10-13T10:29:18Z

src/platforms/common/stm32/swo_uart.c

@@ -130,13 +130,11 @@ void swo_uart_deinit(void)
 {
 	/* Disable the UART and halt DMA for it, grabbing the number of bytes left in the buffer as we do */
 	usart_disable(SWO_UART);
-	const uint16_t space_remaining = dma_get_number_of_data(SWO_DMA_BUS, SWO_DMA_CHAN);


The TRM didn't make it clear if disabling the channel reset the value read to the initial value programmed - this will need testing particularly on a GD32F1-based probe to ensure this doesn't cause an unforced error.

I'd argue it can not, it would violate a basic principle of DMA - disabling the channel is not doing anything, not even stopping pending operations, it just masks future requests. Re-enabling it has to continue where it left off.

I have moved it below jsut to make sure transfers are not going to continue for the few cycles after the readout, but I can keep it where it is.

dragonmux · 2024-10-15T14:59:53Z

src/platforms/native/Makefile.inc

Please include these changes in the Meson build system too.

Tried to, hopefully it works

dragonmux · 2024-10-15T15:02:36Z

src/platforms/common/stm32/swo_manchester_dma.c

+// a proper break in the pulse sequence
+//#define SWO_ADVANCED_RECOVERY	1
+
+#define FORCE_INLINE	inline __attribute__((always_inline))


Do not do this. Let the compiler choose by just saying inline. This actively pesimises the inlining vs code size balance and should only be done when absolutely required. Let the compiler do its job otherwise please.

Found another way to do this, but I cannot agree with the general sentiment :) I also prefer to not use force inlining, but in this case the compiler chose not to inline even when inlining made the code several times faster and smaller. There is a reason this attribute exists.

It might be unfortunate, but writing this kind of extreme performance code is an endless loop of making the compiler generate the assembly one would write on their own (while still being portable to other CPU architectures, maybe with less optimal performance). At the same time, it is also true the compiler is better at ordering the instructions to make better use of various CPU pipeline stages which makes it generally faster than handwritten assembly, not to mention more maintainable. But the optimization heuristics are simply not able to understand the intent/requirements in all cases.

dragonmux · 2024-10-15T15:03:54Z

src/platforms/common/stm32/swo_manchester_dma.c

+	uint16_t rx, t, q;
+	uint8_t s;
+	int32_t b;


Please use expressive variable names - what are these, what do they mean?! We are not limited on variable name length, so please use the name to describe the purpose.

It might seem a little harsh, but the idea is that you should be describing to the reader, who might be yourself in 6 months, what the intent of all this is and what these mean/do so it can be maintained and adjusted if there are bugs found. That's just not practical with single-letter variable names which is why this is a clang-tidy lint. Every attempt to read this code and figure out what it does becomes a reverse engineering exercise otherwise.

Kind of a matter of opinion, and these "variables" are very clearly documented inside the main processing function, this is just a stash to store the state in the meantime. I'd argue the C-style long variables make code very hard to read because there is an abysmal signal to noise ratio.

Anyway, changing this, I don't wish to force my will on the project :)

dragonmux · 2024-10-15T15:05:21Z

src/platforms/common/stm32/swo_manchester_dma.c

+	if ((*USB_EP_REG(SWO_ENDPOINT) & USB_EP_TX_STAT) != USB_EP_TX_STAT_VALID)
+	{
+		swo_send_buffer(usbdev, SWO_ENDPOINT);
+	}


The braces are not necessary, please drop them. Please also run this file through clang-format.

It has been run through clang-format. Also, I'm very hesitant to see code without braces - remember goto fail;?

clang-format is now really applied :) I incorrectly presumed make clang-format does it since it changed a ton of files, but it goes only two levels deep.

src/platforms/common/stm32/swo_manchester_dma.c

dragonmux · 2024-10-15T15:15:39Z

Note: We are aware of the breakage on the lint pass from GitHub's deployment of Ubuntu 24.04 LTS, we will get this fixed in the mean time.

Edit: This has now been fixed, when you rebase this PR on main next, that fix will automatically get pulled in and used.

…ementation using DMA

… system

…er_data

ssimek mentioned this pull request Oct 13, 2024

Feature: Alternate Manchester SWO implementation using DMA #1958

Closed

6 tasks

ssimek force-pushed the feature/fast-dma-traceswo branch from 513b832 to 669659e Compare October 15, 2024 06:31

dragonmux requested changes Oct 15, 2024

View reviewed changes

ssimek force-pushed the feature/fast-dma-traceswo branch 4 times, most recently from 0a26202 to 7b87080 Compare October 18, 2024 08:11

ssimek added 6 commits October 18, 2024 10:11

common/stm32/swo: remove reduntant buffer available bytes tracking

32fe2cc

common/stm32/swo_manchester: add an alternate Manchester capture impl…

c63dbf5

…ementation using DMA

common/stm32/swo_manchester: improve implementation, update naming

602db58

common/stm32/swo_uart: partially restore order of deinit operations

99a6103

native: replace swo_manchester with swo_manchester_dma in Meson build…

3a987d6

… system

common/stm32/swo_manchester: fix circular buffer handling in swo_buff…

b19e2cb

…er_data

ssimek force-pushed the feature/fast-dma-traceswo branch from 7b87080 to 45fa106 Compare October 18, 2024 08:11

common/stm32/swo: apply clang-format to all touched files

37d8ba2

ssimek force-pushed the feature/fast-dma-traceswo branch from 45fa106 to 37d8ba2 Compare October 18, 2024 11:07

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature: Alternate Manchester SWO implementation using DMA #1960

Feature: Alternate Manchester SWO implementation using DMA #1960

ssimek commented Oct 13, 2024

dragonmux left a comment

dragonmux Oct 13, 2024

ssimek Oct 16, 2024

dragonmux Oct 13, 2024

ssimek Oct 16, 2024

dragonmux Oct 15, 2024

ssimek Oct 16, 2024

dragonmux Oct 15, 2024

ssimek Oct 16, 2024

dragonmux Oct 15, 2024

ssimek Oct 16, 2024

dragonmux Oct 15, 2024

ssimek Oct 16, 2024

ssimek Oct 17, 2024

dragonmux commented Oct 15, 2024 •

edited

Loading

Feature: Alternate Manchester SWO implementation using DMA #1960

Are you sure you want to change the base?

Feature: Alternate Manchester SWO implementation using DMA #1960

Conversation

ssimek commented Oct 13, 2024

Fast DMA-based SWO Manchester decoding

Detailed description

Your checklist for this pull request

Closing issues

dragonmux left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dragonmux commented Oct 15, 2024 • edited Loading

dragonmux commented Oct 15, 2024 •

edited

Loading