Skip to content

Failure to read memory in ARM vmcore captured by dump-capture kernel #461

Open
@alecrivers

Description

@alecrivers

Hello,

First off, great project, thanks for it!

I've been debugging a nasty kernel oops, capturing vmcore files using a kexec'ed dump-capture kernel on the affected device. (I don't bother using makedumpfile to compress the cores.) I found that trying to get stack traces, e.g. prog.crashed_thread().stack_trace(), typically only showed the first stack frame, and then an empty frame at a meaningless address. The crash utility meanwhile was able to get a full stack trace, but I wanted drgn's ability to report local variables and structures.

Doing a bunch of debugging, I found that drgn's unwinder was doing the right thing in terms of looking in the right place for the next frame's FP. However, when it went to read that memory, it was getting the wrong value. I could check this by doing prog.read(<virtual memory address of the next FP>), which gave a different answer than asking crash to read the same address. Digging further, I found that the physical memory address translation was wrong. But I was surprised to find that doing prog.read(follow_phys(prog["init_mm"].address_of_(), <address>), 4, True) gave the correct answer.

Looking deeper, I found that follow_phys() and crash were both referring to the page table to get their lookup data, whereas prog.read() was using the PT_LOAD data from the core dump. readelf gave:

Elf file type is CORE (Core file)
Entry point 0x0
There are 4 program headers, starting at offset 52

Program Headers:
  Type           Offset   VirtAddr   PhysAddr   FileSiz MemSiz  Flg Align
  NOTE           0x001000 0x00000000 0x00000000 0x00d94 0x00d94     0x4
  LOAD           0x002000 0x80000000 0x10000000 0x58000000 0x58000000 RWE 0
  LOAD           0x58002000 0xe8000000 0x78000000 0x7ff0000 0x7ff0000 RWE 0
  LOAD           0x5fff2000 0xf0000000 0x80000000 0x10000000 0x10000000 RWE 0

The virtual memory address in question was inside the last range.

I found that if I ignored that last memory section (simply by skipping it in drgn_program_set_core_dump_fd_internal()), everything started working. Huzzah!

For now, because I'm on a tight deadline, I don't have time to investigate why this last section may be incorrect. But I do note that that last segment is 256MB in size, which is exactly the same size as the amount of memory reserved by crashkernel for the dump-capture kernel. (I could, but haven't, tried changing that size and seeing if it changes too, and I know next to nothing about core dumps so I can't say off the bat if it obviously is or isn't related.)

Incidentally, while trying various things before this workaround, I also tried using libkdumpfile to read the core, but ran into the issue that

#ifdef WITH_KDUMPFILE
is WITH_KDUMPFILE while I think that the correct line would be WITH_LIBKDUMPFILE. However, changing that and using libkdumpfile didn't resolve my problem.

Thanks again for a great project.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions