Skip to content

Commit

Permalink
Add more on flash issues, clarify free space on bootloader
Browse files Browse the repository at this point in the history
Let me "clarify" the free flash numbers; actually, there's significantly less flash left than we said there was.

Add to optiboot_dx an ugly hack if you end up in the perfect storm of settings that we know will overflow the flash we have by 2 bytes.
Relates to #490
  • Loading branch information
SpenceKonde committed Oct 18, 2023
1 parent e41f7d2 commit 2baae84
Show file tree
Hide file tree
Showing 2 changed files with 106 additions and 13 deletions.
75 changes: 69 additions & 6 deletions megaavr/bootloaders/optiboot_dx/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -24,12 +24,12 @@ This is being maintained as part of DxCore, so issues be reported to:
## Known issues
There are no known issues at this time (other than the fact that there is no EEPROM support That is because it does not fit. It might fit if we didn't need to buffer pages and could write data as it came in, but because we don't know we're getting a program page command until the fire hose of data has been turned on, we can't get rid of that so easily. It was a design decision to not lock in a 1024 byte bootloader section just to get EEPROM write capability; and the consequences are particularly serious on modern AVRs which cannot tolerate )

Available "slack" in the binaries is extremely limited.
* 128k Dx: 6-7 instruction words (12b)
* 64k Dx: 14 instruction words (14b)
* 64k DD: 14 instruction words (28b)
* 32k Dx: 18 instruction words (36b)
* 32k DD: 17 instruction words (34b)
Available "slack" in the binaries is extremely limited - worst case values are:
* 128k Dx: 2 instruction word (4b)
* 64k Dx: 7 instruction words (14b)
* 64k DD: 10 instruction words (20b)
* 32k Dx: 13 instruction words (26b)
* 32k DD: 13 instruction words (26b)

LED Pins
* DA/DB, all: PA7
Expand Down Expand Up @@ -182,6 +182,9 @@ Since the 64k parts have no program memory locations above the first 64k, RAMPZ

It is plausible that with more aggressive optimization, it might be possible to make room for EEPROM writes on 64k and smaller flash. It would take a miracle to do that on 128k parts.

### Change 10/17/23:
As of 10/17/2023, users building custom versions of optiboot_dx can specify baud rates lower than 62750 now even for 128k parts. This is done by, if (and only if, for readability sake) we are using the ALT serial port (costs 3 words) AND we have 128k of flash (costs 6 words) AND the baud rate is below 62750. It works by replacing the UART initialization code with a little piece of asm, and explicitly prepares a pointer for it. This saves 2 words of flash, and was the easiest way to get these configurations to build. The price is a blob of asm to do a trivial task just to save 4 bytes.

### Wastes of space
There are a few points at which the compiler generates terrible, terrible code.

Expand Down Expand Up @@ -452,3 +455,63 @@ It is worth noting that this is not what typical compiled sketches are full of.
6: 81 11 cpse r24, r1 < test for empty RSTFR
142: df 12 cpse r13, r31
```

So, what can be done?

* Obviously, the places marked "Pathological" can be remodeled, saving 6 words as described above. Same for the WTF's which might be 2 more.
* Actually wringing these out is much harder than looking at the asm listing and seeing what it's doing stupidly and what better code would look like
* The datasheet *does not specify* that you need to set the TX pin output. Certainly you must not set it output if you use ODME per errata on many parts.
* ODME is on CTRLB, so if combined with the existing write, could set ODME for free.
* Thus, even if in normal mode, you don't get any output, since serial adapters almost invariably have a weak internal pullup on their RX, you should be able to make it work by simply setting ODME, since you don't care about the state of the TX pin if you don't have anything connected to it.
* This saves 1 word only, because it just gets rid of a sbi.
* There are a number of issues in the quality of the compiled code in general:
* As usual, it stumbles when you're trying to fill a word one byte at a time, as mentioned in the pathological code - that is obviously wrong.
* It's not clear how it's using r28-r29 - in some cases they appear to be using it as a scratch register.
* Could that register be used better by statically allocating it? Does it hold anything of value when it is getting pushed and popped? Why that register?
* Registers used also reveals a considerable amount of unused slack that could be sacrificed for flash.

### Registers used and free
Bootloader space optimization might take advantage of the large number of unused registers for an advantage. Here's what the usage list looks like for a 128kb AVR Dx (which is where space problems are most acute):

#### Fixed Lower Working Registers
* r0 - the tempreg is used ONLY while actually writing the flash (you load the value to write into r0 and r1, then execute SPM). Doesn't get touched from anywhere else!
* r1 - Zero except when writing the flash as above.

#### Used Lower Working Registers
* r8 - Used very lightly
* r9 - Used very lightly
* r10 - r10 and r11 somewhat lightly used
* r11
* r12 - r12 and r13 used pretty heavily.
* r13
* r14 - r14 and r15 combined used only 4x, sometimes as a single reg, other times not.
* r15

#### Used Upper Working Registers
* r16 - r16 and r17 used 8 times combined.
* r17 -
* r18 - r18 and r19 used 8 times combined.
* r19 -
* r20 - Only used in a single place. To ferry a single value around. No need to use this reg.
* r24 w - compilers favorite register! 67/174 accesses!
* r25 w - Used 19 times, plus the times it was a register pair with r24.
* r26 X - "huh, only used once?" "psst, it's the X pointer, it's used right and left" "Oh, duh..."
* r27 X - (not explicitly, but accessed by movw is used with r26 and when using the X pointer)
* r28 Y
* r29 Y
* r30 Z
* r31 Z

#### Registers Currently Looking for Work
* r0 - (very nearly - it gets trashed when the flash is actually written)
* r2
* r3
* r4
* r5
* r6
* r7
* r21 - Well, will ya lookit that, unused high registers!
* r22
* r23

That's 4 wholly unused register pairs and a single, and the single and one pair is high. Thus, the only resources we are short on is flash and of course pointer registers - though we are a little bit tight on flash on the 128k bootloader, this situation has considerably more hope that it might have, as the compiler is showing these obvious pathologies, making it's code particularly bad, which is also easier to fix, and there's plenty of register maneuvering room.
44 changes: 37 additions & 7 deletions megaavr/bootloaders/optiboot_dx/optiboot_dx.c
Original file line number Diff line number Diff line change
Expand Up @@ -151,6 +151,9 @@
unsigned const int __attribute__((section(".version"))) __attribute__((used))
optiboot_version = 256 * (OPTIBOOT_MAJVER + OPTIBOOT_CUSTOMVER) + OPTIBOOT_MINVER;

#define xstr(s) str(s)
#define str(s) #s


#include <inttypes.h>
#include <avr/io.h>
Expand Down Expand Up @@ -265,6 +268,10 @@ typedef union {
#error Unachievable baud rate (too slow) BAUD_RATE
#endif // baud rate slow check

#if BAUD_SETTING_4 > 255
#define BAUD_SETTING_L xstr(BAUD_SETTING_4 % 256)
#define BAUD_SETTING_H xstr(BAUD_SETTING_4 / 256)
#endif
/*
* Watchdog timeout translations from human readable to config vals
*/
Expand Down Expand Up @@ -462,13 +469,38 @@ int main (void) {
// Why the compiler wastes a word on the ldi 0 is a mystery to me, since it knows it's got a 0 right there in r1 ready to go...
#if (BAUD_SETTING_4 < 256)
MYUART.BAUDL = BAUD_SETTING_4;
//MYUART.DBGCTRL = 1; // run during debug
MYUART.CTRLC = (USART_CHSIZE_gm & USART_CHSIZE_8BIT_gc); // Async, Parity Disabled, 1 StopBit
//MYUART.CTRLA = 0; // Interrupts: all off
MYUART.CTRLB = USART_RXEN_bm | USART_TXEN_bm;
#else
MYUART.BAUD = BAUD_SETTING_4;
/* Aw shit, do we have a problem??*/
#if PROGMEM_SIZE > 65536 && (defined(MYUART_PMUX_VAL) && MYUART_PMUX_VAL != 0)
/* yeah, we have a problem. This bootloader doesn't fit in this size flash - we only have 2
* words left, but setting another SFR would take 3 words. Here we take truly desperate measures
* To guarantee that it will render this using a pointer and std. Setting up the pointer takes 2
* words, but we then access it 4 times, previously with lds. Every time we swap an lds for an ldd
* we save 1 word, hence the net is -2 registers.
*/
uint16_t uartptr = (uint16_t)(&MYUART); // We need a pointer to the UART.
__asm__ __volatile__("" // Initialize the USART
"std %a0+8, %A1" "\n\t" // Store at ptr + 8 - BAUD_L.
"std %a0+9, %B1" "\n\t" // Store at ptr + 9 - BAUD_H.
"ldi r21, 0x03" "\n\t" // load (USART_CHSIZE_8BIT_gc - rest of byte is 0).
"std %a0+7, r21" "\n\t" // Store to ptr + 7 - CTRLC.
"ldi r21, 0xC0" "\n\t" // USART_RXEN_bm | USART_TXEN_bm.
"std %a0+6, r21" "\n\t" // Store to ptr + 6 - CTRLB, this enables the UART.
:"+b" ((uint16_t)uartptr) // This is the all important pointer to uart
:"r"((uint16_t)BAUD_SETTING_4) // The baud setting has to get there somehow
:"r21"); // We clobber r21;
#else // On 64k and less we can do this without that ugly hack,
MYUART.BAUD = BAUD_SETTING_4;
//MYUART.DBGCTRL = 1; // run during debug
MYUART.CTRLC = (USART_CHSIZE_gm & USART_CHSIZE_8BIT_gc); // Async, Parity Disabled, 1 StopBit
//MYUART.CTRLA = 0; // Interrupts: all off - Unnecessary! We ensured that the chip was freshly reset so this is already 0.
MYUART.CTRLB = USART_RXEN_bm | USART_TXEN_bm;
#endif
#endif
//MYUART.DBGCTRL = 1; // run during debug
MYUART.CTRLC = (USART_CHSIZE_gm & USART_CHSIZE_8BIT_gc); // Async, Parity Disabled, 1 StopBit
//MYUART.CTRLA = 0; // Interrupts: all off
MYUART.CTRLB = USART_RXEN_bm | USART_TXEN_bm;

#if (LED_START_FLASHES > 0) || defined(LED_DATA_FLASH) || defined(LED_START_ON)
/* Set LED pin as output */
Expand Down Expand Up @@ -867,8 +899,6 @@ static inline void read_flash(addr16_t address, pagelen_t length)
* This can always be removed or trimmed if more actual program space
* is needed in the future. Currently the data occupies about 160 bytes,
*/
#define xstr(s) str(s)
#define str(s) #s
#define OPTFLASHSECT __attribute__((section(".fini8")))
#define OPT2FLASH(o) OPTFLASHSECT const char f##o[] = #o "=" xstr(o)

Expand Down

0 comments on commit 2baae84

Please sign in to comment.