Failure to read memory in ARM vmcore captured by dump-capture kernel

Hello,

First off, great project, thanks for it!

I've been debugging a nasty kernel oops, capturing vmcore files using a kexec'ed dump-capture kernel on the affected device. (I don't bother using makedumpfile to compress the cores.) I found that trying to get stack traces, e.g. `prog.crashed_thread().stack_trace()`, typically only showed the first stack frame, and then an empty frame at a meaningless address. The `crash` utility meanwhile was able to get a full stack trace, but I wanted drgn's ability to report local variables and structures.

Doing a bunch of debugging, I found that drgn's unwinder was doing the right thing in terms of looking in the right place for the next frame's FP. However, when it went to read that memory, it was getting the wrong value. I could check this by doing `prog.read(<virtual memory address of the next FP>)`, which gave a *different* answer than asking `crash` to read the same address. Digging further, I found that the physical memory address translation was wrong. But I was surprised to find that doing `prog.read(follow_phys(prog["init_mm"].address_of_(), <address>), 4, True)` gave the *correct* answer.

Looking deeper, I found that `follow_phys()` and `crash` were both referring to the page table to get their lookup data, whereas `prog.read()` was using the PT_LOAD data from the core dump. `readelf` gave:

```
Elf file type is CORE (Core file)
Entry point 0x0
There are 4 program headers, starting at offset 52

Program Headers:
  Type           Offset   VirtAddr   PhysAddr   FileSiz MemSiz  Flg Align
  NOTE           0x001000 0x00000000 0x00000000 0x00d94 0x00d94     0x4
  LOAD           0x002000 0x80000000 0x10000000 0x58000000 0x58000000 RWE 0
  LOAD           0x58002000 0xe8000000 0x78000000 0x7ff0000 0x7ff0000 RWE 0
  LOAD           0x5fff2000 0xf0000000 0x80000000 0x10000000 0x10000000 RWE 0
```

The virtual memory address in question was inside the last range.

I found that if I ignored that last memory section (simply by skipping it in `drgn_program_set_core_dump_fd_internal()`), everything started working. Huzzah!

For now, because I'm on a tight deadline, I don't have time to investigate why this last section may be incorrect. But I do note that that last segment is 256MB in size, which is exactly the same size as the amount of memory reserved by `crashkernel` for the dump-capture kernel. (I could, but haven't, tried changing that size and seeing if it changes too, and I know next to nothing about core dumps so I can't say off the bat if it obviously is or isn't related.)

Incidentally, while trying various things before this workaround, I also tried using libkdumpfile to read the core, but ran into the issue that https://github.com/osandov/drgn/blob/970b9a085790a9b23325bbd06f120fdc2d7d664a/libdrgn/python/main.c#L5 is `WITH_KDUMPFILE` while I think that the correct line would be `WITH_LIBKDUMPFILE`. However, changing that and using libkdumpfile didn't resolve my problem.

Thanks again for a great project.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Failure to read memory in ARM vmcore captured by dump-capture kernel #461

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Failure to read memory in ARM vmcore captured by dump-capture kernel #461

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions