Proposal to change virtual memory layout #1196

wkozaczuk · 2022-05-06T22:16:42Z

Instead of repeating what was written on the emailing list, I am pasting it here verbatim:

"In the last couple of weeks, I have been working on various issues related to aarch64 port. I have managed to make good progress and I will be sending new patches soon.

Two issues had to do with making Java run on aarch64 - #1145 and #1157. After exchanging some emails on the OpenJDK emailing list and researching this problem, I finally discovered that the problem only happens when JIT is enabled and is caused by the fact that the JIT compiler generates machine code to access an arbitrary address in memory in a way that assumes all addresses are 48 bits, meaning first 16 bits are 0. And here are the details:

"Once I got hold of the JDK debuginfo files and identified the patching code - MacroAssembler::pd_patch_instruction(), I was able to put a breakpoint in it and see something very revealing:

#0  MacroAssembler::pd_patch_instruction_size (branch=0x20000879465c "\351\377\237\322\351\377\277\362\351\377\337\362\n\243\352\227\037",
    target=0xffffa00042c862e0 "\020zB") at src/hotspot/cpu/aarch64/macroAssembler_aarch64.cpp:75
#1  0x0000100000bc13cc in MacroAssembler::pd_patch_instruction (file=0x0, line=0, target=0xffffa00042c862e0 "\020zB", branch=<optimized out>)
    at src/hotspot/cpu/aarch64/macroAssembler_aarch64.hpp:626
#2  NativeMovConstReg::set_data (this=0x20000879465c, x=-105551995837728) at src/hotspot/cpu/aarch64/nativeInst_aarch64.cpp:262
#3  0x0000100000850bd0 in CompiledIC::set_ic_destination_and_value (value=0xffffa00042c862e0,
    entry_point=0x20000823d290 "(\b@\271\b\001]\322*\005@\371\037\001\n\353,\001", <incomplete sequence \371\200>, this=<optimized out>)
    at src/hotspot/share/code/compiledIC.hpp:193
#4  ICStub::finalize (this=<optimized out>) at src/hotspot/share/code/icBuffer.cpp:91
#5  ICStubInterface::finalize (this=<optimized out>, self=<optimized out>) at src/hotspot/share/code/icBuffer.cpp:43
#6  0x0000100000e30958 in StubQueue::stub_finalize (this=0xffffa00041555300, s=<optimized out>) at src/hotspot/share/code/stubs.hpp:168
#7  StubQueue::remove_first (this=0xffffa00041555300) at src/hotspot/share/code/stubs.cpp:175
...

The corresponding crash value of X9 was this:

0x0000a00042c862e0

vs the target argument of pd_patch_instruction() (see above in the backtrace):

0xffffa00042c862e0

Now given this comment:

// Move a constant pointer into r. In AArch64 mode the virtual
// address space is 48 bits in size, so we only need three
// instructions to create a patchable instruction sequence that can
// reach anywhere.

and this fragment of pd_patch_instruction() - https://github.com/openjdk/jdk17u/blob/6f0f42630eac1febf562062afc523fdf3d2a920a/src/hotspot/cpu/aarch64/macroAssembler_aarch64.cpp#L152-L161

it seems that the code to load x8 register with an address gets patched with 0x0000a00042c862e0 instead of 0xffffa00042c862e0. It is interesting that this assert - https://github.com/openjdk/jdk17u/blob/6f0f42630eac1febf562062afc523fdf3d2a920a/src/hotspot/cpu/aarch64/macroAssembler_aarch64.cpp#L77 - does not get hit.

The bottom line is that the valid address 0xffffa00042c862e0 gets truncated to 0x0000a00042c862e0 I guess based on the assumption that in Linux all userspace addresses are 48-bits long (see https://www.kernel.org/doc/html/latest/arm64/memory.html). In OSv unikernel, there is no separation between user space and kernel space, and it happens that addresses returned by malloc fall into this range:

0xffffa00000000000 - 0xffffafffffffffff

So I guess the only solution to fix it on the OSv side would be to tweak its virtual memory mapping for mallocs and make sure it never uses virtual addresses > 48-bits."

Currently, OSv maps this part of virtual memory like so:

 ------  0x ffff 8000 0000 0000  phys_mem      --\ 
 |    |                                          |- Main Area - 16T 
 ------  0x ffff 9000 0000 0000                --X 
 |    |                                          |- Page Area - 16T 
 ------  0x ffff a000 0000 0000                --X 
 |    |                                          |- Mempool Area - 16T 
 ------  0x ffff b000 0000 0000                --X 
 |    |                                          |- Debug Area - 80T 
 ------  0x ffff ffff ffff ffff                --/

I wonder if this was an arbitrary choice made early in OSv design and if there was some good reason for it.

Nadav comments:

Well, the story for x86 address is as follows (as far as I remember):

Although x86 is nominally a 64-bit address space, it didn't fully support the entire 64 bits and doesn't even now.
Rather (see a good explanation in https://en.wikipedia.org/wiki/X86-64, "canonical form address") it only supported 48 bits.
Moreover, all the highest bits must be copies of bit 47. So basically you have 47 bits (128 TB) with all highest bits 0, and another 128 TB with the highest bits 1 - these are the 0xfffff... addresses.

So it was convenient for OSv to divide the address space with one half for mmap and one half for malloc.

If we want OSv to only use the lower 128 TB of address space, that should be fairly easy to fix, we just need to check if we don't have code checking things like "address < 0" to know if something was malloced because that will no longer be correct.

The fact we'll only allow 128 TB of virtual addresses is fine - AFAIR (see https://www.kernel.org/doc/Documentation/x86/x86_64/mm.txt) Linux allows even less - it uses bit 47 (and all higher bits are the same) to divide kernel and user space addresses, so kernel only has 128 TB of address space, but half of it is devoted to mapping user-space memory in the kernel, so the user space is actually limited to 64 TB of address space.

All that being said, the address structure is likely to be completely different in aarch64. Maybe they just use the 48 lower bits and don't use the "canonical address" stuff. In this case aarch64 should certainly check the bit 47, not 63, for the malloc bit, and NOT sign-extend addresses (don't add 0xfff.. if bit 47 is 1).

Could this be changed to this:

------  0x 0000 8000 0000 0000  phys_mem      --\ 
 |    |                                          |- Main Area - 16T 
 ------  0x 0000 9000 0000 0000                --X 
 |    |                                          |- Page Area - 16T 
 ------  0x 0000 a000 0000 0000                --X 
 |    |                                          |- Mempool Area - 16T 
 ------  0x 0000 b000 0000 0000                --X 
 |    |                                          |- Debug Area - 80T 
 ------  0x 0000 ffff ffff ffff                --/

Indeed, this would be the division for aarch64, for x86-64 you can't do it like this because these aren't canonical addresses - the addresses with the bit 47 on also need to set all bits 48..63 to on as well, and you end up with what we did. On x86-64 you can drop one bit of address space, and split the memory on bit 46 (so bit 47 is always 0).

I did manage to hack the code for aarch64 and it seems to be working.

I also found a similar case for x86_64. The library rapidjson used by dotnet uses last 16 bits of addresses to compress some info into it - see Tencent/rapidjson#546 (comment).

Wow. I have deja-vu that we have seen something similar in the past in some Boost locking functions or something, which also try to reuse various bits on pointers, but I can't find the explanation now.

Now going forward I think Linux will extend the userspace addresses eventually from 48 bits to 56 bits (see https://en.wikipedia.org/wiki/Intel_5-level_paging) or higher. And dotnet actually made a fix to disable this high 16-bits hack. But given there are Linux apps that may assume that addresses are 48-bit and take advantage of it, it might be wise to change the OSv virtual memory layout to use the lower part only (<= 0x 0000 ffff ffff ffff).

Right, but again, I think it can't be <= 0x 0000 ffff ffff ffff it needs to be <= 0x 0000 7fff ffff ffff because of the sign extension thing.
I think it is certainly fine to lose one bit of address space - as I noted above Linux actually lost two, and Windows for many years lost even more.

If we aren't sure we can make even this a compile-time flag.

I think it's a good idea. It might not be trivial to fix it though, because we might have code which instead of calling a proper function to check if some memory is allocated by malloc or mmap,
does something like if (p < 0). But I hope if we have such code there isn't too much of it...

"

The text was updated successfully, but these errors were encountered:

nyh closed this as completed in 6de9f41 Oct 16, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Proposal to change virtual memory layout #1196

Proposal to change virtual memory layout #1196

wkozaczuk commented May 6, 2022 •

edited

Loading

Proposal to change virtual memory layout #1196

Proposal to change virtual memory layout #1196

Comments

wkozaczuk commented May 6, 2022 • edited Loading

wkozaczuk commented May 6, 2022 •

edited

Loading