Skip to content

Conversation

d-netto
Copy link
Member

@d-netto d-netto commented Jan 27, 2025

PR Description

Backports JuliaLang#57045.

I needed to make some minor adjustments in jl_gc_wait_for_the_world, because we're spinning waiting for the safepoint in 1.10, but we sleep on a condition in 1.12.

Marking as draft until I make the adjustments (e.g. pretty-printing the backtraces to JSON) to make sure this shows up on DD.

Checklist

Requirements for merging:

…7045)

This is still a work in progress, but it should help determine what a
straggler thread was doing during the stop-the-world phase and why it
failed to reach a safepoint in a timely manner.

We've encountered long TTSP issues in production, and this tool should
provide a valuable means to accurately diagnose them.
@d-netto d-netto requested review from NHDaly and kpamnany January 27, 2025 19:47
@github-actions github-actions bot added port-to-v1.10 port-to-v1.12 This change should apply to Julia v1.12 builds labels Jan 27, 2025
@d-netto d-netto marked this pull request as draft January 27, 2025 19:47
@d-netto d-netto removed the port-to-v1.12 This change should apply to Julia v1.12 builds label Jan 27, 2025
Copy link
Member

@NHDaly NHDaly left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Like we discussed, I think you should merge this as-is, and then make any other changes in a separate commit, so that we can separate the backport and the RAI-specific changes we'd want to keep.

@d-netto d-netto marked this pull request as ready for review February 2, 2025 21:41
@d-netto
Copy link
Member Author

d-netto commented Feb 2, 2025

Ran the MWE with julia -t4 --timeout-for-safepoint-straggler=1:

function main()
    t = Threads.@spawn begin
        ccall(:uv_sleep, Cvoid, (Cuint,), 5000)
    end
    # Force a GC
    ccall(:uv_sleep, Cvoid, (Cuint,), 1000)
    GC.gc()
    wait(t)
end
main()

And got:

===== Thread 4 failed to reach safepoint after 1 seconds, printing backtrace below =====
thread (1) __semwait_signal at /usr/lib/system/libsystem_kernel.dylib (unknown line)
===== Thread 4 failed to reach safepoint after 1 seconds, printing backtrace below =====
thread (1) __semwait_signal at /usr/lib/system/libsystem_kernel.dylib (unknown line)
===== Thread 4 failed to reach safepoint after 1 seconds, printing backtrace below =====
thread (1) __semwait_signal at /usr/lib/system/libsystem_kernel.dylib (unknown line)

@d-netto d-netto merged commit 1fff8cc into v1.10.2+RAI Feb 2, 2025
5 checks passed
@d-netto d-netto deleted the dcn-stw-straggler-backtrace branch February 2, 2025 21:47
nickrobinson251 pushed a commit that referenced this pull request Feb 26, 2025
…7045) (#208)

This is still a work in progress, but it should help determine what a
straggler thread was doing during the stop-the-world phase and why it
failed to reach a safepoint in a timely manner.

We've encountered long TTSP issues in production, and this tool should
provide a valuable means to accurately diagnose them.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants