Skip to content

Revert linker command file fixes: .bss/.noinit sections, z_mapped_start marker #61595

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
Aug 17, 2023

Conversation

fabiobaltieri
Copy link
Member

Revert #60368 as it's breaking few tests, see #61572.

… binary"

This reverts commit 36997de, as that's
breaking few tests in CI, see:

zephyrproject-rtos#61572

Signed-off-by: Fabio Baltieri <fabiobaltieri@google.com>
…ion"

This reverts commit 9e9e60b.

Signed-off-by: Fabio Baltieri <fabiobaltieri@google.com>
…d file changes"

This reverts commit eb731ab.

Signed-off-by: Fabio Baltieri <fabiobaltieri@google.com>
@zephyrbot zephyrbot added Release Notes To be mentioned in the release notes area: ARM ARM (32-bit) Architecture labels Aug 17, 2023
@fabiobaltieri fabiobaltieri assigned ibirnbaum and unassigned jhedberg Aug 17, 2023
@fabiobaltieri fabiobaltieri changed the title Revert https://github.com/zephyrproject-rtos/zephyr/pull/60368 Revert #60368 Aug 17, 2023
@fabiobaltieri fabiobaltieri added the Hotfix Fix for issues blocking development, i.e. CI issues, tests failing in CI, etc. label Aug 17, 2023
Copy link
Member

@microbuilder microbuilder left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @fabiobaltieri ... I think reverting these is the best choice until we can sort out the issues in CI.

@nashif nashif changed the title Revert #60368 Revert linker command file fixes: .bss/.noinit sections, z_mapped_start marker Aug 17, 2023
@ibirnbaum
Copy link
Member

ibirnbaum commented Aug 17, 2023

@SgrrZhf commented in #61572 that his observation is that __kernel_ram_start points to address 0x0. This doesn't surprise me and I think that it should already have been like this before. The move of the .bss and .noinit sections within the image shouldn't have any influence on this. I'll look into the pre/post diffs once again, but here's a general note on this observation:

The DTS for the ZynqMP (used in qemu_cortex_r5), the DTS fvp_baser_aemv8r.dts mentioned by SgrrZhf and probably a few more all specify the memory range used as RAM to start at address 0x0. From my point of view, this isn't valid (or at least unfortunate) in general and in particular doesn't work on an actual UltraScale chip unless the system is configured in a particular way.

Address 0x0 is, I'd guess in the majority of cases, the location of the exception vectors. As far as I can tell, there's two options for having the vectors at a different location: one is the HIVECS bit, which, at least on the Zynq and ZynqMP, requires a non-default memory configuration in order to enable memory located at the alternative high address of the vectors. The other one is using the VBAR register for a custom location, which the Cortex-A has, but the aarch32 Cortex-R doesn't have (there's something similar in Cortex-M, too). For any non-Cortex-M, non-ARM64-Cortex-R CPU, the HIVECS bit always gets cleared in relocate_vector_table(), btw.

The standard way of doing things is to have the vectors at the very beginning of the Zephyr image in RAM. This allows booting a stripped, non-ELF binary without entry point information (when copied to the right location), as the first instruction is the reset vector jump which will take you to __start. Regardless of whether we reach __start this way or via its location info from the ELF header, during early boot, the vectors are copied to their target location in RAM. At least for Cortex-A, the vectors section is padded to 4k due to the MMU page size. If the binary is loaded to address 0x0, the vectors are basically being copied to their current location.

The vectors section should always have been covered by __kernel_ram_start, with regards to that section, the z_mapped_start marker was the relevant fix to prevent the MMU on the Cortex-A from assigning the vectors page at the RAM start address as dynamically mapped memory at run-time. The following sections like .text, .data etc. have always been covered by z_mapped_start.

I'd say that in general, having the actual vectors location 0x0 and the base of the image in RAM overlap is unfortunate, as, at least on Xilinx chips, this wouldn't work in real life (QEMU simply doesn't care about the validity of your memory layout, aside from many other things, at least for the UltraScale). The Zynq-7000 has an offset of 0x00100000 for the RAM, the memory at address 0x0 is the On-Chip RAM (which has to be explicitly re-configured in case it should rather show up at the HIVECS location), and there's a gap between them. On the UltraScale, there's two Tightly Coupled Memory blocks starting at address 0x0, once again, only the minimum required to have working memory for the vectors at 0x0 is enabled by default. If you locate the image starting at 0x0, there's a black hole starting at either 0x10000 or 0x20000, based on the TCM configuration. Also, there's a mode in which the two TCM blocks only have half their size, but are transparently accessible under the same address core-local. In that case, there's two black holes within the TCM range before RAM starts behind that. There's the ancient #26577 addressing this issue for the UltraScale (the Zynq-7000 DTS correctly specifies the start of RAM at 0x00100000), at that time, the generic memory-region nodes didn't exist yet, so maybe it's about time to re-declare the ZynqMP's TCMs as off limits for normal RAM use.

On the one hand, I'd prefer some spatial separation between the vectors' zero-page and the start of the image in general, this copying the vectors over themselves somehow doesn't look pretty. I'd rather declare the active, live vectors page in memory as a separate memory node in the DTS so that everything but the initial vector copy stays outta there and that memory location isn't covered by any kind of markers generated by the linker referring to the regular contents of the Zephyr image. On the other hand, I think that on most real-life SoCs, the memory at address 0 is special in terms of its technical implementation (for example, on the Zynq-7000 it already works even if the PLLs and therefore the external RAM haven't been set up yet) and that there's no guarantee that there's contiguous memory available from address 0x0 all the way into regular RAM.

I've just moved the RAM base address in the ZynqMP DTS upwards a little. Let's see if this helps with the problematic test case. If that's the case, I'll look into whether the issue is the address 0 and if __kernel_ram_start has always pointed there. I also can't exclude an issue within the MPU code due to the modified section layout (but probably with an identical zero __kernel_ram_start) yet.

I'll check once again what the situation was like before regarding __kernel_ram_start and what the MPU does with it.

@ibirnbaum
Copy link
Member

ibirnbaum commented Aug 17, 2023

I've dabbled back and forth a bit so far, and I think that the whole issue with memory protection test(s) failing isn't exclusively down to my changes to the linker command file. With the linker command file from before my modifications, the following test cases (i.e., all the ones from mem_protect which aren't skipped but actually run) fail:

INFO - 1) tests/kernel/mem_protect/futex/kernel.futex on qemu_cortex_r5 failed (Timeout)
INFO - 2) tests/kernel/mem_protect/userspace/kernel.memory_protection.userspace on qemu_cortex_r5 failed (Timeout)
INFO - 3) tests/kernel/mem_protect/syscalls/kernel.memory_protection.syscalls on qemu_cortex_r5 failed (Timeout)
INFO - 4) tests/kernel/mem_protect/sys_sem/kernel.memory_protection.sys_sem on qemu_cortex_r5 failed (Timeout)
INFO - 5) tests/kernel/mem_protect/sys_sem/kernel.memory_protection.sys_sem.nouser on qemu_cortex_r5 failed (Timeout)
INFO - 6) tests/kernel/mem_protect/protection/kernel.memory_protection.protection on qemu_cortex_r5 failed (Timeout)
INFO - 7) tests/kernel/mem_protect/obj_validation/kernel.memory_protection.obj_validation on qemu_cortex_r5 failed (Timeout)
INFO - 8) tests/kernel/mem_protect/mem_protect/kernel.memory_protection on qemu_cortex_r5 failed (Timeout)
INFO - 9) tests/kernel/mem_protect/stack_random/kernel.memory_protection.stack_random on qemu_cortex_r5 failed (Timeout)
INFO - 10) tests/kernel/mem_protect/stackprot/kernel.memory_protection.stackprot on qemu_cortex_r5 failed (Timeout)

... if in the ZynqMP's DTS, the sram0 memory region begins at a location other than 0. So it seems that there's some code relating to the MPU stuff that only works with a fixed RAM base @ 0x0. This shouldn't happen as, as mentioned above, it's highly unlikely that all real-life targets will support that address as RAM base.

@ibirnbaum
Copy link
Member

ibirnbaum commented Aug 17, 2023

Okay, I admit it: __kernel_ram_start wasn't @ 0 before, but at the start of the bss section. To make the difference between a complete, successful test execution and an early boot crash while using the old linker command file, all it takes is a different RAM base address. I can see the offset in the map file, relative to the RAM base, __kernel_ram_start stays the same. Checking what causes the crash with a non-zero RAM base now.

EDIT: the mpu_config struct has 4 entries in the testcase I'm currently debugging, two of them being "SRAM_PRIV" and "SRAM". It seems that there's no association between them and the device tree, they always have a base address entry of 0x0, regardless of what the "zephyr,sram" chosen entry in the active DT specifies. "SRAM" ends up with a size of 2 MB, while "SRAM_PRIV" has a 2 GB size value in the MPU's region table. Both these values seem somewhat weird considering that sram0 is specified in the DT as 64 MB.

EDIT2: Yup, for the ZynqMP, the zero base address is hardcoded. This /probably/ should be fixed. The MPU region decls for fvp_aemv8r look clean, though. Fixing the hardcoded values in the ZynqMP locally for now so that I can continue looking into how the old and the new linker command file impact the MPU setup.

Copy link
Member

@ibirnbaum ibirnbaum left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ZynqMP seems to be broken anyways regarding the MPU setup vs. the RAM base address anyways, this needs to be fixed first. Investigating further into what goes wrong with the updated linker command file with a manually fixed MPU ranges list for the ZynqMP, will keep updating here. If there's anything wrong with the MPU code, that's another PR before #60368 can be re-applied.

@fabiobaltieri fabiobaltieri merged commit 83e4ed1 into zephyrproject-rtos:main Aug 17, 2023
@fabiobaltieri fabiobaltieri deleted the arm-rollback branch August 17, 2023 19:56
@fabiobaltieri
Copy link
Member Author

@ibirnbaum many thanks for the investigation work.

@ibirnbaum
Copy link
Member

@fabiobaltieri @microbuilder I've looked into this issue a little bit further, although I haven't reached a conclusion yet. What I can say is that with the RAM base address moved to 1 MB (in the DTS and hardcoded in the ZynqMP MPU init code), the tests that caused the CI to fail on qemu_cortex_r5 do actually pass on actual ZynqMP hardware - with both the old linker command file and the changes proposed in #60368.

I do believe that in general, qemu_cortex_r5 definitely needs an overhaul regarding the MPU setup, not only is the RAM base address hardcoded at the time being, also, .rodata isn't considered at all - RAM is always full access. There's a few items that need to be linked to both the sections contained in the binary and the device tree.

In addition to that, I believe that qemu_cortex_r5 might also be broken on the QEMU side. As mentioned before, the memory protection tests pass on actual hardware while QEMU won't even boot. What I've discovered so far is that with the RAM moved to the proper base address, some code still accesses an address located somewhere in the BTCM. If I don't have an MPU mapping that covers at least the front half of the BTCM, Zephyr won't boot even up to the banner. However, on actual hardware, I don't need this mapping, so obviously, noone's touching the BTCM there, because my debugger didn't stop on any exception. Unfortunately, I'm still fighting gdb in order to be able to step through the code. Most gdb frontends fail because they're trying to load an aarch32 binary, while the Xilinx QEMU gdb stub identifies itself as aarch64, as the A-Cores are covered by gdb as well. I've made a little progress getting to the R-cores via gdb's command line, however, it's not really usable yet.

I'll keep digging. It's unfortunate that this target caused the CI to fail, I suspect that the issue is more on the target side than on the linker command file side. Basically the same issue the new linker command file was supposed to fix (bloated binaries due to .bss/.noinit being somewhere in the middle of the image) has recently also been reported for aarch64 in #62578.

@fabiobaltieri
Copy link
Member Author

@ibirnbaum thanks again for the investigation work, from what I understand of your explanation it sounds like qemu here is not representing the underlying hardware correctly and blocking a change that fixes something on real hardware. Sounds like qemu should be fixed then. Do you think that there would be an argument for disabling the tests on qemu-r5 in the meantime?

@microbuilder
Copy link
Member

@ibirnbaum thanks again for the investigation work, from what I understand of your explanation it sounds like qemu here is not representing the underlying hardware correctly and blocking a change that fixes something on real hardware. Sounds like qemu should be fixed then. Do you think that there would be an argument for disabling the tests on qemu-r5 in the meantime?

I was going to suggest the same thing (disabling this platform for MPU tests on this platform, ideally adding a note why so it can be re-enabled in the future).

Also, FYI @wmamills who may have some insight into the QEMU problems since I believe this is coming from the Xilinx fork of QEMU.

@microbuilder
Copy link
Member

@ibirnbaum Do you want to give this one more try, excluding qemu_cortex_r5 from the MPU tests since it's demonstrably not a problem with the code here, but the QEMU model? That might enable us to get this in before feature freeze.

@ibirnbaum
Copy link
Member

@microbuilder What's the timeframe up to the feature freeze? In order to be sure that it's actually QEMU's R5 emulation that's malfunctioning, I'd like to run all of the memory protection testcases on my actual hardware prior to re-opening the linker command file PR. Due to this hardware only being bootable via JTAG, this is a relatively time-consuming process. Also, the same applies to my Cortex-A9 hardware, where I'd also like to run the tests on actual hardware, with the MMU in this case, which takes longer than just running them on qemu_cortex_a9.

@fabiobaltieri
Copy link
Member Author

@microbuilder What's the timeframe up to the feature freeze?

RC1 is due at Sept 29th: https://github.com/zephyrproject-rtos/zephyr/wiki/Release-Management

@ibirnbaum
Copy link
Member

@fabiobaltieri @microbuilder Sorry, I won't get around to too much debugging over the next few days, and I'd still like to dig a little deeper regarding the issues with qemu_cortex_r5. I wouldn't want to exclude a target platform from those relatively important test cases on the basis that one or two manually executed tests executed on out-of-tree hardware worked better, without yet knowing why.

The bloated binaries are something that should definitely be fixed, but from my point of view, that might just as well happen with the next release rather than the upcoming one.

A word of warning in advance: I ran into issues with massive stack corruption with the recently overhauled aarch32 arch code, and I /think/ that the code related to the TRACING_OBJECT_TRACKING feature is to blame. I hope that I'll get around to narrowing the source of the problem down sometime towards next week, so that all the required info for at least opening an issue report is there.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area: ARM ARM (32-bit) Architecture Hotfix Fix for issues blocking development, i.e. CI issues, tests failing in CI, etc. Release Notes To be mentioned in the release notes
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants