Description
Hello,
First off, great project, thanks for it!
I've been debugging a nasty kernel oops, capturing vmcore files using a kexec'ed dump-capture kernel on the affected device. (I don't bother using makedumpfile to compress the cores.) I found that trying to get stack traces, e.g. prog.crashed_thread().stack_trace()
, typically only showed the first stack frame, and then an empty frame at a meaningless address. The crash
utility meanwhile was able to get a full stack trace, but I wanted drgn's ability to report local variables and structures.
Doing a bunch of debugging, I found that drgn's unwinder was doing the right thing in terms of looking in the right place for the next frame's FP. However, when it went to read that memory, it was getting the wrong value. I could check this by doing prog.read(<virtual memory address of the next FP>)
, which gave a different answer than asking crash
to read the same address. Digging further, I found that the physical memory address translation was wrong. But I was surprised to find that doing prog.read(follow_phys(prog["init_mm"].address_of_(), <address>), 4, True)
gave the correct answer.
Looking deeper, I found that follow_phys()
and crash
were both referring to the page table to get their lookup data, whereas prog.read()
was using the PT_LOAD data from the core dump. readelf
gave:
Elf file type is CORE (Core file)
Entry point 0x0
There are 4 program headers, starting at offset 52
Program Headers:
Type Offset VirtAddr PhysAddr FileSiz MemSiz Flg Align
NOTE 0x001000 0x00000000 0x00000000 0x00d94 0x00d94 0x4
LOAD 0x002000 0x80000000 0x10000000 0x58000000 0x58000000 RWE 0
LOAD 0x58002000 0xe8000000 0x78000000 0x7ff0000 0x7ff0000 RWE 0
LOAD 0x5fff2000 0xf0000000 0x80000000 0x10000000 0x10000000 RWE 0
The virtual memory address in question was inside the last range.
I found that if I ignored that last memory section (simply by skipping it in drgn_program_set_core_dump_fd_internal()
), everything started working. Huzzah!
For now, because I'm on a tight deadline, I don't have time to investigate why this last section may be incorrect. But I do note that that last segment is 256MB in size, which is exactly the same size as the amount of memory reserved by crashkernel
for the dump-capture kernel. (I could, but haven't, tried changing that size and seeing if it changes too, and I know next to nothing about core dumps so I can't say off the bat if it obviously is or isn't related.)
Incidentally, while trying various things before this workaround, I also tried using libkdumpfile to read the core, but ran into the issue that
Line 5 in 970b9a0
WITH_KDUMPFILE
while I think that the correct line would be WITH_LIBKDUMPFILE
. However, changing that and using libkdumpfile didn't resolve my problem.
Thanks again for a great project.