Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add profiling of already running tasks via SIGINFO/SIGUSR1 #43179

Merged
merged 1 commit into from
Feb 15, 2022

Conversation

IanButterworth
Copy link
Member

@IanButterworth IanButterworth commented Nov 21, 2021

Updates:

  • Now triggers off SIGINFO on BSD platforms via ctrl-t and SIGUSR1 on other platforms. Unfortunately ctrl-z was actually already used by some to force terminate julia, which is a more natural use of SIGTSTP and there's no other cross-platform keymap available. So on MacOS & FreeBSD you can ctrl-t. On Linux it's triggered by SIGUSR1 e.g. kill -USR1 $julia_pid
  • The report prints after the stacktrace that SIGINFO/SIGUSR1 already prints.
  • Expands the Profile docs. See those for up to date examples

Say you're in interactive mode, and you've started running some mixed async/threaded code, and it's taking too long, or something odd is happening, currently one option for debugging is to ctrl-c to interrupt the code and get a stacktrace, but that's a singular point in time, and can obscure things happening in different threads/tasks, and it terminates the code.

Profiling is useful, but again would require you to stop the code, and re-run having launched the code with @profile. Often you want to see what's happening this time around.

This PR is a demo/RFC to introduce ctrl-z to trigger an async 1-second Profile, and print the report without stopping the code. The profile report shows the new per-thread breakdown.
Because it doesn't stop the code it can be run multiple times at the user's discretion, perhaps to profile the different stages of a long running process?

Some thoughts:

  • ctrl-z raises SIGTSTP, which seemed like the only common posix signal/keymap remaining available across platforms. Is this a reasonable use of that? or is there a better fit?
  • One catch with the implementation here is that it's dependent on the running code having yield points to print the report. I wasn't sure how to directly call the julia code from signals-unix.c.
  • This required buffering the Profile.print() report, to make it print in one go rather than at multiple yield points
  • The length & print characteristics of the profiling could have adjustable defaults in Profile
  • The formatting of the profile report is tweaked to make the grouped version a bit more visually compact
  • This adds Profile.print to the REPL sysimage precompiler to make the first call fast
  • Duration can be changed via Profile.set_peek_duration(2.5)
  • This PR is more of a working demo. I'm sure it could be better executed.

Here, the code is run, then a few seconds later I press ctrl-z. The report is printed within a few seconds in this case

julia> while true
       rand(1000,1000) * rand(1000,1000)
       yield()
       end
^Z SIGTSTP received. Collecting a 1 second profile report...
Profile collected. A report will print at the next yield point.
Overhead ╎ [+additional indent] Count File:Line; Function
=========================================================
Thread 1 Total snapshots: 609. Utilization: 100%
   ╎176 @Base/client.jl:506; _start()
   ╎ 176 @Base/client.jl:318; exec_options(opts::Base.JLOptions)
   ╎  176 @Base/client.jl:388; run_main_repl(interactive::Bool, quiet::Bool, banner::Bool, history_fil...176 @Base/essentials.jl:729; invokelatest
   ╎    176 @Base/essentials.jl:731; #invokelatest#2176 @Base/client.jl:403; (::Base.var"#935#937"{Bool, Bool, Bool})(REPL::Module)
   ╎    ╎ 176 @REPL/src/REPL.jl:354; run_repl(repl::REPL.AbstractREPL, consumer::Any)
   ╎    ╎  176 @REPL/src/REPL.jl:367; run_repl(repl::REPL.AbstractREPL, consumer::Any; backend_on_current_...
   ╎    ╎   176 @REPL/src/REPL.jl:230; start_repl_backend(backend::REPL.REPLBackend, consumer::Any)
   ╎    ╎    176 @REPL/src/REPL.jl:245; repl_backend_loop(backend::REPL.REPLBackend)
   ╎    ╎     176 @REPL/src/REPL.jl:151; eval_user_input(ast::Any, backend::REPL.REPLBackend)
   ╎    ╎    ╎ 176 @Base/boot.jl:368; eval
   ╎    ╎    ╎  176 REPL[1]:2; top-level scope
   ╎    ╎    ╎   176 @Random/src/Random.jl:278; rand
   ╎    ╎    ╎    176 @Random/src/Random.jl:290; rand
   ╎    ╎    ╎     176 @Random/src/Random.jl:289; rand
   ╎    ╎    ╎    ╎ 74  @Base/boot.jl:469; Array
   ╎    ╎    ╎    ╎  74  @Base/boot.jl:462; Array
 74╎    ╎    ╎    ╎   74  @Base/boot.jl:454; Array
   ╎    ╎    ╎    ╎ 102 @Random/src/Random.jl:268; rand!
   ╎    ╎    ╎    ╎  102 @Random/src/XoshiroSimd.jl:292; rand!
   ╎    ╎    ╎    ╎   102 @Random/src/XoshiroSimd.jl:141; xoshiro_bulk
   ╎    ╎    ╎    ╎    3   @Random/src/XoshiroSimd.jl:248; xoshiro_bulk_simd(rng::Random.TaskLocalRNG, dst::Ptr{UInt8}, ...
  1╎    ╎    ╎    ╎     1   @Random/src/XoshiroSimd.jl:70; _plus
  2╎    ╎    ╎    ╎     2   @Random/src/XoshiroSimd.jl:57; _rotl23
   ╎    ╎    ╎    ╎    2   @Random/src/XoshiroSimd.jl:249; xoshiro_bulk_simd(rng::Random.TaskLocalRNG, dst::Ptr{UInt8}, ...
  2╎    ╎    ╎    ╎     2   @Random/src/XoshiroSimd.jl:64; _shl17
   ╎    ╎    ╎    ╎    2   @Random/src/XoshiroSimd.jl:252; xoshiro_bulk_simd(rng::Random.TaskLocalRNG, dst::Ptr{UInt8}, ...
  2╎    ╎    ╎    ╎     2   @Random/src/XoshiroSimd.jl:76; _xor
   ╎    ╎    ╎    ╎    2   @Random/src/XoshiroSimd.jl:253; xoshiro_bulk_simd(rng::Random.TaskLocalRNG, dst::Ptr{UInt8}, ...
  2╎    ╎    ╎    ╎     2   @Random/src/XoshiroSimd.jl:76; _xor
   ╎    ╎    ╎    ╎    11  @Random/src/XoshiroSimd.jl:255; xoshiro_bulk_simd(rng::Random.TaskLocalRNG, dst::Ptr{UInt8}, ...
 11╎    ╎    ╎    ╎     11  @Random/src/XoshiroSimd.jl:54; _rotl45
   ╎    ╎    ╎    ╎    82  @Random/src/XoshiroSimd.jl:256; xoshiro_bulk_simd(rng::Random.TaskLocalRNG, dst::Ptr{UInt8}, ...
   ╎    ╎    ╎    ╎     17  @Base/pointer.jl:118; unsafe_store!
 17╎    ╎    ╎    ╎    ╎ 17  @Base/pointer.jl:118; unsafe_store!
 65╎    ╎    ╎    ╎     65  @Random/src/XoshiroSimd.jl:107; _bits2float

Thread 2 Total snapshots: 609. Utilization: 0%
   ╎609 @Base/threadingconstructs.jl:178; (::Distributed.var"#133#135")()
   ╎ 609 @Distributed/src/remotecall.jl:269; macro expansion
   ╎  609 @Base/condition.jl:78; lock
   ╎   609 @Base/lock.jl:183; lock(f::Distributed.var"#134#136", l::ReentrantLock)
   ╎    609 @Distributed/src/remotecall.jl:271; #134609 @Base/condition.jl:124; wait(c::Base.GenericCondition{ReentrantLock})
   ╎    ╎ 609 @Base/task.jl:882; wait()
608╎    ╎  609 @Base/task.jl:873; poptask(W::Base.InvasiveLinkedListSynchronized{Task})

Thread 3 Total snapshots: 609. Utilization: 0%
   ╎609 @Base/task.jl:587; task_done_hook(t::Task)
   ╎ 609 @Base/task.jl:882; wait()
608609 @Base/task.jl:873; poptask(W::Base.InvasiveLinkedListSynchronized{Task})

@jonas-schulze
Copy link
Contributor

jonas-schulze commented Nov 22, 2021

What do you think of using different keys for this, e.g. ctrl-t? Recently, I was thinking to propose ctrl-z for moving the currently running code into the background and returning the corresponding Task, as if it had been started using @async. I don't know if the latter would be easily possible, but I think ctrl-z would be a better fit for that, given it's similar to the job system in bash.

Edit: if there is nothing running at the moment (i.e. the REPL is just displaying its prompt), then ctrl-z would behave normally, i.e. sending Julia itself into the background. I use ctrl-z somewhat frequently for this.

@antoine-levitt
Copy link
Contributor

Emacs reacts to SIGUSR2 to drop to a debugger to interrupt a stuck instance, maybe julia could also use this signal to trigger the profile (eg to interrupt a noninteractive session)

@IanButterworth
Copy link
Member Author

@jonas-schulze what you're suggesting does suit the SIGTSTP name a bit better.. It's just unfortunate that it seems like ctrl-z is the only commonly assigned keymapped signal that doesn't appear to be used by Julia. Unless I'm mistaken.

Regarding ctrl-t, on MacOS & FreeBSD that's assigned to SIGINFO. When julia gets a SIGINFO, or SIGUSR1, it dumps out an unformatted stack trace without terminating. It's reasonably similar to this.
As for SIGUSR2 that's what the Profile timer uses to record a profile snapshots automatically. I don't believe that's commonly keymapped.

I'd be keen to see what other people think trying this out. It seems to work pretty well, and I just added the report to the sysimage REPL precompiler so that the report is prepared quickly first time.

@IanButterworth
Copy link
Member Author

Also, with the last commit, if the code has no yield point then the profile is still done, but the report won't show until the process finishes/is interrupted

@jonas-schulze
Copy link
Contributor

jonas-schulze commented Nov 23, 2021

Another option would be to add the profiler output after the standard output of SIGINFO (or replace that? I don't know how useful it is by default). The keymapping would be handy, but I don't think this should impact the decision:

julia ... # ctrl-z
bg
kill -SIGINFO %1 # send SIGINFO to job 1
kill -SIGUSR1 %1
kill -SIGUSR2 %1

Copy link
Member

@jpsamaroo jpsamaroo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome!

src/signals-unix.c Outdated Show resolved Hide resolved
src/signals-unix.c Outdated Show resolved Hide resolved
base/task.jl Outdated Show resolved Hide resolved
src/signal-handling.c Outdated Show resolved Hide resolved
@IanButterworth IanButterworth changed the title RFC: Add a way to profile already running code Profile already running tasks via SIGINFO/SIGUSR1 Dec 4, 2021
@IanButterworth IanButterworth force-pushed the ib/profile_snap branch 3 times, most recently from dada8ab to 7901e4c Compare December 5, 2021 06:21
@IanButterworth
Copy link
Member Author

IanButterworth commented Dec 5, 2021

I think the linux64 failure is real, but I don't understand what's triggering it, and why only that platform

Error in testset Profile:
Test Failed at /buildworker/worker/tester_linux64/build/share/julia/stdlib/v1.8/Profile/test/runtests.jl:165
  Expression: getline(values(fdictc)) == getline(values(fdict0)) + 2
   Evaluated: nothing == 24
Error in testset Profile:
Test Failed at /buildworker/worker/tester_linux64/build/share/julia/stdlib/v1.8/Profile/test/runtests.jl:187
  Expression: parse(Int, s) > 100
   Evaluated: 48 > 100

It doesn't happen locally on linux64, neither with nor without rr.

It's repeatable

Error in testset Profile:
Test Failed at /buildworker/worker/tester_linux64/build/share/julia/stdlib/v1.8/Profile/test/runtests.jl:187
  Expression: parse(Int, s) > 100
   Evaluated: 54 > 100

Copy link
Member

@vchuravy vchuravy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens if there is a current @profile running and I trigger the 1 second profile?

base/Base.jl Outdated Show resolved Hide resolved
base/Base.jl Outdated Show resolved Hide resolved
@IanButterworth
Copy link
Member Author

What happens if there is a current @Profile running and I trigger the 1 second profile?

On BSDs the stack trace will be printed only, like it currently is for ctrl-t/SIGINFO.
Not sure on linux. Perhaps nothing.

Basically it won't trigger a profile if it's already profiling.

stdlib/Profile/src/Profile.jl Outdated Show resolved Hide resolved
@IanButterworth IanButterworth force-pushed the ib/profile_snap branch 2 times, most recently from b279cd8 to 3326631 Compare December 13, 2021 21:20
@IanButterworth
Copy link
Member Author

I think I figured out the linux64 problem.. running = 1 was being set after the SIGUSR1 timer was being started, so the first SIGUSR1 or two were being interpreted as user signals erroneously and setting the peek autostopper

@IanButterworth IanButterworth force-pushed the ib/profile_snap branch 2 times, most recently from 42b89da to 30b1109 Compare December 13, 2021 22:58
@IanButterworth
Copy link
Member Author

Bump @vtjnash 🙏

@IanButterworth IanButterworth added this to the 1.8 milestone Jan 4, 2022
@IanButterworth
Copy link
Member Author

I'm adding this as a 1.8 milestone as I think it would be good to get in, if approved of

@IanButterworth
Copy link
Member Author

IanButterworth commented Feb 12, 2022

Triage was supportive of this, but made the point that the print trigger mechanism should live in Profile rather than Base, so that is now the case.

@IanButterworth
Copy link
Member Author

@JeffBezanson @vtjnash, given the discussion in triage.

It would be great if this could get into 1.8 🙏🏻. Please let me know if there are blocking issues.

@IanButterworth
Copy link
Member Author

Great, thanks @vchuravy

@IanButterworth IanButterworth changed the title Profile already running tasks via SIGINFO/SIGUSR1 Add profiling of already running tasks via SIGINFO/SIGUSR1 Feb 15, 2022
@IanButterworth IanButterworth merged commit a9aad97 into JuliaLang:master Feb 15, 2022
@IanButterworth IanButterworth deleted the ib/profile_snap branch February 15, 2022 03:11
IanButterworth added a commit to IanButterworth/julia that referenced this pull request Feb 15, 2022
antoine-levitt pushed a commit to antoine-levitt/julia that referenced this pull request Feb 17, 2022
antoine-levitt pushed a commit to antoine-levitt/julia that referenced this pull request Feb 17, 2022
LilithHafner pushed a commit to LilithHafner/julia that referenced this pull request Feb 22, 2022
LilithHafner pushed a commit to LilithHafner/julia that referenced this pull request Feb 22, 2022
LilithHafner pushed a commit to LilithHafner/julia that referenced this pull request Mar 8, 2022
LilithHafner pushed a commit to LilithHafner/julia that referenced this pull request Mar 8, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants