Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

hifive on MSC+SMP too often fails IPC0001 #64

Closed
lsf37 opened this issue Dec 12, 2021 · 10 comments
Closed

hifive on MSC+SMP too often fails IPC0001 #64

lsf37 opened this issue Dec 12, 2021 · 10 comments

Comments

@lsf37
Copy link
Member

lsf37 commented Dec 12, 2021

The config HIFIVE_verification_SMP_MCS_gcc_64 seem to now very often fail the test IPC0001 by hanging at that test.

This may be a result of the new build environment with downgrades riscv-gcc from version 10 to version 8 (as opposed to the upgrade from 8 to 10 on all other platforms).

The other configurations on hifive pass and this configuration passes on other boards.

For a sample, see https://github.com/seL4/seL4/runs/4494651972?check_suite_focus=true#step:4:12004

@lsf37
Copy link
Member Author

lsf37 commented Dec 12, 2021

Instead of "too often", I should say "almost always". I think I have seen it pass once, but some other board had a problem for that test run.

@lsf37
Copy link
Member Author

lsf37 commented Dec 20, 2021

Just confirming that this is indeed related to the build environment. The tests pass fine for the same configuration if the image was built with the old docker images.

@lsf37
Copy link
Member Author

lsf37 commented Jan 9, 2022

So, it looks like something might actually be properly broken in SMP on RISCV: I've upgraded to gcc-11 temporarily, and now we're getting a consistent failure (timeout) on HIFIVE_debug_SMP_gcc_64 (release and verification builds succeed).

We also have SCHED0021 failing on HIFIVE_release_SMP_MCS_gcc_64, not clear if that is related or not.

Here is an example run.

The docker container with gcc 11.1.0 for riscv64 is trustworthysystems/sel4-riscv:latest (sha256:20bf07826ac0a1c81f9a620d21023ff0fe84e1300b03d6f804836da3cfcd1c75). It is pushed to docker hub, so you can pull it down, but I haven't updated the docker file repo yet, because this is still in testing and the -riscv images are not otherwise used any more.

@kent-mcleod
Copy link
Member

I think this issue is because on SMP there is a chance that the kernel thinks the HART ID for HART 1 is actually HART 0 on the hifive because of this line: https://github.com/seL4/seL4_tools/blob/master/elfloader-tool/src/arch-riscv/crt0.S#L92 which should be CONFIG_FIRST_HART_ID instead of 0, and I think that the comment should refer to a0 instead of a1.

@lsf37
Copy link
Member Author

lsf37 commented Feb 2, 2022

Very nice find!

@kent-mcleod
Copy link
Member

I think this issue is because on SMP there is a chance that the kernel thinks the HART ID for HART 1 is actually HART 0 on the hifive because of this line: https://github.com/seL4/seL4_tools/blob/master/elfloader-tool/src/arch-riscv/crt0.S#L92 which should be CONFIG_FIRST_HART_ID instead of 0, and I think that the comment should refer to a0 instead of a1.

This is a bit wrong. The cause of the test failure is due to the elfloader thinking that one of the cores has hart ID 0 when the valid range of IDs that can run in smode is 1-4 inclusive. At some point the ID gets lost:

Platform Name       : SiFive Freedom U540
Platform Features   : timer,mfdeleg
Platform HART Count : 4
Boot HART ID        : 2
Boot HART ISA       : rv64imafdcsu
BOOT HART Features  : pmp,scounteren,mcounteren
BOOT HART PMP Count : 16
Firmware Base       : 0x80000000
Firmware Size       : 100 KB
Runtime SBI Version : 0.2

MIDELEG : 0x0000000000000222
MEDELEG : 0x000000000000b109
PMP0    : 0x0000000080000000-0x000000008001ffff (A)
PMP1    : 0x0000000000000000-0x0000007fffffffff (A,R,W,X)
ELF-loader started on (HART 1) (NODES 4)
  paddr=[80200000..80625047]
Looking for DTB in CPIO archive...found at 8021dd58.
Loaded DTB from 8021dd58.
   paddr=[84022000..84024fff]
ELF-loading image 'kernel' to 84000000
  paddr=[84000000..84021fff]
  vaddr=[ffffffff84000000..ffffffff84021fff]
  virt_entry=ffffffff84000000
ELF-loading image 'sel4test-driver' to 84025000
  paddr=[84025000..8444bfff]
  vaddr=[10000..436fff]
  virt_entry=1c3be
Main entry hart_id:1
Secondary entry hart_id:0 core_id:1
Secondary entry hart_id:4 core_id:2
Hart ID 1 core ID 0
Hart ID 0 core ID 1
Hart ID 4 core ID 2
Secondary entry hart_id:3 core_id:3
Hart ID 3 core ID 3
Enabling MMU and paging

And when it works:

  Platform Name       : SiFive Freedom U540
  Platform Features   : timer,mfdeleg
  Platform HART Count : 4
  Boot HART ID        : 1
  Boot HART ISA       : rv64imafdcsu
  BOOT HART Features  : pmp,scounteren,mcounteren
  BOOT HART PMP Count : 16
  Firmware Base       : 0x80000000
  Firmware Size       : 100 KB
  Runtime SBI Version : 0.2
  
  MIDELEG : 0x0000000000000222
  MEDELEG : 0x000000000000b109
  PMP0    : 0x0000000080000000-0x000000008001ffff (A)
  PMP1    : 0x0000000000000000-0x0000007fffffffff (A,R,W,X)
  ELF-loader started on (HART 1) (NODES 4)
    paddr=[80200000..805fb047]
  Looking for DTB in CPIO archive...found at 80219e00.
  Loaded DTB from 80219e00.
     paddr=[8401e000..84020fff]
  ELF-loading image 'kernel' to 84000000
    paddr=[84000000..8401dfff]
    vaddr=[ffffffff84000000..ffffffff8401dfff]
    virt_entry=ffffffff84000000
  ELF-loading image 'sel4test-driver' to 84021000
    paddr=[84021000..84427fff]
    vaddr=[10000..416fff]
    virt_entry=1bede
  Main entry hart_id:1
  Hart ID 1 core ID 0
  Secondary entry hart_id:4 core_id:3
  Secondary entry hart_id:2 core_id:1
  Secondary entry hart_id:3 core_id:2
  Hart ID 4 core ID 3
  Hart ID 2 core ID 1
  Hart ID 3 core ID 2
  Enabling MMU and paging

This corruption is due to register s0 getting overwritten, likely during the call to clear_bss with no stack pointer set. So when the hartid of the boot core is then restored to a0 from s0 it has become 0.

seL4/seL4_tools#135 solves this issue as it sets the stack pointer before clear_bss is called.

@axel-h
Copy link
Member

axel-h commented Feb 2, 2022

There are more issues with the multicore boot when switching the hart actually. I've tried to fix them in the last commit in seL4/seL4_tools#132 and will put this on top of seL4/seL4_tools#135 becuase the fix for the stack setup comes handy then also.

@kent-mcleod
Copy link
Member

I moved the changes to resolve early boot issues into it's own PR: seL4/seL4_tools#136

@kent-mcleod
Copy link
Member

This should now be resolved.

@lsf37
Copy link
Member Author

lsf37 commented Feb 6, 2022

Yes, tests on the hifive seem to be running smoothly now again. Thanks for figuring that one out!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants