Skip to content

createdump improvements for better VS4Mac Watson bucketing #71863

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
Jul 13, 2022

Conversation

mikem8361
Copy link
Contributor

@mikem8361 mikem8361 commented Jul 8, 2022

Customer Impact

These changes will help improving and diagnose unactionable Watson VS4Mac bucketing and reduce the size of MacOS core dumps.

Details

Add target process terminated/alive message.

Smaller MacOS dumps. Don't add share_mode == SM_EMPTY regions.

Add crashreport success status message for VS4Mac.

Launch createdump from SIGTERM handler directly to reduce the time it takes to get the crash report/dump for VS4Mac.

Testing

Verified by the VS4Mac team.

Risk

Low risk because mostly createdump changes. Minor ones in the runtime.

@mikem8361 mikem8361 requested review from janvorli and hoyosjs July 8, 2022 23:24
@mikem8361 mikem8361 self-assigned this Jul 8, 2022
@ghost
Copy link

ghost commented Jul 8, 2022

Tagging subscribers to this area: @tommcdon
See info in area-owners.md if you want to be subscribed.

Issue Details

Customer Impact

These changes will help improving and diagnose unactionable Watson VS4Mac bucketing.

Details

Add target process terminated/alive message.

Smaller MacOS dumps. Don't add share_mode == SM_EMPTY regions.

Add crashreport success status message for VS4Mac.

Launch createdump from SIGTERM handler directly to reduce the time it takes to get the crash report/dump for VS4Mac.

Testing

Verified by the VS4Mac team.

Risk

Low risk because mostly createdump changes. Minor ones in the runtime.

Author: mikem8361
Assignees: mikem8361
Labels:

area-Diagnostics-coreclr

Milestone: -

@mikem8361
Copy link
Contributor Author

@kdubau and @sgmunn can you verify these changes? There will be some runtime packages available as soon as the CI jobs finish.

Copy link
Member

@janvorli janvorli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thank you!

@teo-tsirpanis teo-tsirpanis added this to the 6.0.x milestone Jul 9, 2022
@sgmunn
Copy link

sgmunn commented Jul 11, 2022

I did three tests, the first with a triage dump that we use for delays, the second a force quit and a third with a pinvoke crash. The second produced the crash report but failed to write out the core dump (I'm sure this used to work on this machine in the past, but this isn't attributed to this build as I am also seeing issues with a different build).

I can see the log messages that were added. I can see that the core dump is ~50% size (for the very simple start vsmac and then crash from the start page scenario) - 1.4GB vs 3GB on disk, uncompressed.

This looks like it is working as expected, although I cannot specifically validate the sigterm case.

ui delay (arm64)

[createdump] Gathering state for process 15718 
[createdump] Writing crash report to file /var/folders/5z/s0nz7p916nl6lkd00qkkbrym0000gn/T/uidelay-15718-ec1981.dmp.crashreport.json
[createdump] Crash report successfully written
[createdump] Writing triage minidump to file /var/folders/5z/s0nz7p916nl6lkd00qkkbrym0000gn/T/uidelay-15718-ec1981.dmp
[createdump] Written 97616456 bytes (5958 pages) to core file
[createdump] Target process is alive
[createdump] Dump successfully written

Sigterm (arm64)

[createdump] Gathering state for process 15718 
[createdump] Crashing thread 0028b168 signal 0000000f
[createdump] Writing crash report to file /var/folders/5z/s0nz7p916nl6lkd00qkkbrym0000gn/T/coredump.15718.crashreport.json
[createdump] Crash report successfully written
[createdump] Writing minidump with heap to file /var/folders/5z/s0nz7p916nl6lkd00qkkbrym0000gn/T/coredump.15718
[createdump] ReadProcessMemory(0000000283cac000, 00004000) FAILED
[createdump] Writing dump FAILED
[createdump] Target process is alive

pinvoke crash

[createdump] Gathering state for process 16093 
[createdump] Crashing thread 0028e3f2 signal 0000000b
[createdump] Writing crash report to file /var/folders/5z/s0nz7p916nl6lkd00qkkbrym0000gn/T/coredump.16093.crashreport.json
[createdump] Crash report successfully written
[createdump] Writing minidump with heap to file /var/folders/5z/s0nz7p916nl6lkd00qkkbrym0000gn/T/coredump.16093
[createdump] Written 1575076472 bytes (96135 pages) to core file
[createdump] Target process is alive
[createdump] Dump successfully written

@janvorli
Copy link
Member

[createdump] Writing dump FAILED

@mikem8361 I wonder what it really means when we fail to read a block of memory from the target. Is it possible that we might be able to read other parts of the memory without problems and such failure is just a race between our enumeration of blocks vs threads running in the target at that time or something like that?

@mikem8361
Copy link
Contributor Author

The process has been frozen with task_suspend() so all the threads shouldn't be running. All the blocks that we are writing have been validated (by reading a byte per page) before being added to the memory region list in the "gather" phase. But the "Target process is alive" message after the ReadProcessMemory failure doesn't make sense. Our working theory for these failures was that the system terminated the process between the verification and dump writing even while suspended. Either the process is alive checking isn't working correctly or threads running somehow.

@mikem8361
Copy link
Contributor Author

/cc: @leculver, @tommcdon

Copy link
Contributor

@leculver leculver left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The changes here look good to me so far. Don't forget the timestamp issue we discussed, are you planning to tackle that here?

As a reminder, we talked about added a final total time duration to the last message of createdump no matter if the dump was successfully generated or not. For example, but please feel free to change the format:

[createdump] Dump successfully written in 00:00:60.0000196
[createdump] Writing dump FAILED in 00:00:03.0004650

This gives VS4M data around SigTerm/ForceQuit failing due to timeouts versus other reasons that the process "disappeared" (e.g. the OS killed it sooner than 3 seconds). As an additional benefit it will give us some extra telemetry around how long createdump is taking.

@mikem8361
Copy link
Contributor Author

I have main branch changes in PR #71569 that print the elapsed in the number of ms. I was hesitant to add those changes to this 6.0 servicing PR because of the amount of code. There are probably better/simpler ways to do this in C++ but I was leveraging the code the main branch already had for the logging timestamps.

@leculver
Copy link
Contributor

I have main branch changes in PR #71569 that print the elapsed in the number of ms. I was hesitant to add those changes to this 6.0 servicing PR because of the amount of code. There are probably better/simpler ways to do this in C++ but I was leveraging the code the main branch already had for the logging timestamps.

Sounds good, it can go in either branch as long as it's not forgotten. :)

@mikem8361
Copy link
Contributor Author

Lee, you are ok with the total elapsed time logging to go into only main/7.0? And as the number of milliseconds?

@leculver
Copy link
Contributor

Lee, you are ok with the total elapsed time logging to go into only main/7.0? And as the number of milliseconds?

Yep that's fine. It's not urgent, just something to implement wherever it makes sense to do so.

@mikem8361
Copy link
Contributor Author

As far as the SIGTERM ReadProcessMemory failure, I'm currently at a loss on what to do. This PR contains the SIGTERM change that launches createdump way more immediately in the sigterm handler. I could:

  1. Add the "result" error code from the vm_read_overwrite API in ReadProcessMemory to the failure message. I've already done this in main. This might tell us a little more about why it is failing.
  2. Log the each thread's suspend count when we resume the target process. This might tell us if any threads are running.
  3. Don't fail the write dump when ReadProcessMemory fails and continue to the next region. I really think it will just continue to fail all the rest of the reads.
  4. Any other ideas?

Should I go ahead I make/push any or all of this changes as an experiment?

@janvorli
Copy link
Member

It would be great to figure out what's going on, so the logging you've suggested would be helpful. If @sgmunn can repro the problem reasonably frequently, it would be better than leaving this for later and hitting the issue at 3rd party systems.

@mikem8361
Copy link
Contributor Author

@sgmunn can you try the SIGTERM test again with these new changes/build? Thanks.

@mikem8361
Copy link
Contributor Author

@sgmunn would you mind enabling verbose logging (if possible) with COMPlus_CreateDumpVerboseDiagnostics=1 ?

Mike McLaughlin added 2 commits July 12, 2022 01:19
Add target process terminated/alive message.

Smaller MacOS dumps. Don't add share_mode == SM_EMPTY regions.

Add crashreport success status message for VS4Mac.

Launch createdump from SIGTERM handler directly to reduce the time it takes to get the crash report/dump for VS4Mac.
@sgmunn
Copy link

sgmunn commented Jul 12, 2022

Slightly different repro steps, couldn't seem to get past a code sign issue, but I used the bits from the first build and then just createdump from this last build and enabled fully logging

https://gist.github.com/sgmunn/12e438af3e24b41392198b38cf1e2b97

@mikem8361
Copy link
Contributor Author

Your gist proves that the process was terminated before the memory could be written. The "alive" message says it is terminated and all the thread fail to get any status because the thread port is invalid.

I still don't have any clue how to fix or make this better.

@JulieLeeMSFT
Copy link
Member

@mikem8361 and @tommcdon, today is the due date for 6.0 servicing fix tactics approval and merge for August release. If ready in time, please send an email to get tactics approval.
CC @jeffschwMSFT.

@JulieLeeMSFT JulieLeeMSFT requested a review from tommcdon July 12, 2022 18:00
Copy link
Member

@tommcdon tommcdon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@mikem8361 mikem8361 added the Servicing-consider Issue for next servicing release review label Jul 12, 2022
Copy link
Member

@jeffschwMSFT jeffschwMSFT left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approved. We will take for consideration in 6.0.x

Copy link
Member

@hoyosjs hoyosjs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@JulieLeeMSFT
Copy link
Member

Cc @carlossanlop

@JulieLeeMSFT JulieLeeMSFT added Servicing-approved Approved for servicing release and removed Servicing-consider Issue for next servicing release review labels Jul 13, 2022
@JulieLeeMSFT JulieLeeMSFT modified the milestones: 6.0.x, 6.0.9 Jul 13, 2022
@carlossanlop carlossanlop merged commit d7dd98f into dotnet:release/6.0 Jul 13, 2022
@danmoseley danmoseley modified the milestones: 6.0.9, 6.0.8 Jul 21, 2022
@mikem8361 mikem8361 deleted the dumpfixes60 branch July 26, 2022 01:53
@ghost ghost locked as resolved and limited conversation to collaborators Aug 25, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.