Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add per-Task cpu time metric #56320

Open
wants to merge 31 commits into
base: master
Choose a base branch
from

Conversation

nickrobinson251
Copy link
Contributor

@nickrobinson251 nickrobinson251 commented Oct 24, 2024

Close #47351 (builds on top of #48416)

Adds to Task a metric for scheduled CPU time (i.e. amount of time we were actually running user code) and wall time (i.e. the amount of time between it being possible for us to run this task and the task entering a terminal state i.e. done or failed).

Are there other places we need to instrument? Potentially record_cpu_time! could go in wait() but putting it in yield() and wait(condition) seemed a little more accurate to me. I think maybe we need to add something still for async IO, but i wasn't exactly sure where to do that (which is maybe an argument in favour of putting the instrumentation in wait())? Since Channel, ReentrantLock, Event, Timer, Semaphore are all implemented in terms of wait(condition) i think instrumenting that one should be enough?

I'd love to add these metrics in a way that is extendable, so that in future PRs we could potentially add more metrics (TBD which ones would be desirable, but e.g. compilation time, GC time, allocations, potentially a wait-time breakdown (time waiting on locks, channels, in the run queue, etc.))

Timings are off be default but can be turned on globally via starting julia with --task-timing or calling Base.task_timing(true).

Perhaps in future there could be ways to enable this on a per-thread and per-task basis. And potentially in future these same timings could be used by @time (e.g. writing this same timing data to a ScopedValue like in #55103 but only for tasks lexically scoped to inside the @time block).

All names are TBD.

cc @NHDaly @kpamnany @d-netto

@nickrobinson251 nickrobinson251 marked this pull request as ready for review October 24, 2024 21:59
@giordano giordano added multithreading Base.Threads and related functionality needs news A NEWS entry is required for this change feature Indicates new feature / enhancement requests labels Oct 24, 2024
src/init.c Outdated Show resolved Hide resolved
@vilterp vilterp mentioned this pull request Oct 25, 2024
4 tasks
Copy link
Member

@vtjnash vtjnash left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SGTM. I believe you may need to add the reporting functions to public and docs, and the PR to NEWs

Copy link
Contributor

@kpamnany kpamnany left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice!

I'm not too satisfied with the metrics being directly in jl_task_t; hopefully @vtjnash has a suggestion for what to do there.

We now have to separate out spin time, wait time, compilation time, and GC time (in another PR I guess) and the task CPU time will be actual user CPU time.

This metric is only updated when the task yields or completes.
See also [`task_wall_time_ns`](@ref).

Will be `UInt64(0)` if task timings are not enabled.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe this method should throw if timings are not enabled for this task instead of returning 0, which is a valid return value. Thoughts?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

or missing or nothing are also possible options for that case

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO you should not use unsigned integers for representing positive values - you should use Int64 for this instead.

It's better in julia than it is in C, since julia's defaults are less dangerous, but using unsigned ints for positive numeric values is a consistent source of bugs in C programs, and i think should be still frowned upon in julia. (https://jacobegner.blogspot.com/2019/11/unsigned-integers-are-dangerous.html)

IMO Unsigned ints should be used only for things like hashes, identifiers, addresses, and possibly for indexes but even for indexes i'd use signed ints.

But anyway, another benefit is that this also gives -1 as a possible return value for the unset case.

or until the current time if the task has not yet completed.
See also [`task_cpu_time_ns`](@ref).

Will be `UInt64(0)` if task timings are not enabled.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ditto for consistency, although in this case 0 is clearly not a valid return value.

doc/man/julia.1 Outdated Show resolved Hide resolved
src/init.c Outdated Show resolved Hide resolved
src/jlapi.c Outdated Show resolved Hide resolved
src/jlapi.c Outdated
jl_atomic_fetch_add(&jl_task_timing_enabled, -1);
}

// TODO: remove this
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree on the TODO comment, but do please remove all TODOs.

src/jltypes.c Outdated
Comment on lines 3764 to 3768
"is_timing_enabled",
"first_enqueued_at",
"last_dequeued_at",
"cpu_time_ns",
"wall_time_ns"),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should these go in a separate struct called metrics or something @vtjnash? We won't be able to initialize that struct type before this, so it would have to be jl_any_type, but that would mean we'd have to alloc this in jl_new_task. I don't know how/if we can do an embedded struct here?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It could be a separate struct inlined here, but I don't see much value to that. More valuable would probably be marking each of these as atomic though (in jltypes.c) so that observing them isn't UB. Most reads can probably be monotonic, but some writes are probably release and some reads might be acquires

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay on the separation concern. And yeah, I guess it is necessary for these to be atomic. Will discuss with @nickrobinson251.

src/julia.h Outdated Show resolved Hide resolved
src/julia.h Outdated Show resolved Hide resolved
test/cmdlineargs.jl Outdated Show resolved Hide resolved
base/timing.jl Outdated
Comment on lines 137 to 145
function task_timing(b::Bool)
if b
ccall(:jl_task_timing_enable, Cvoid, ())
else
# TODO: prevent decrementing the counter below zero
ccall(:jl_task_timing_disable, Cvoid, ())
end
return nothing
end
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since you've put effort into making this flag reentrant, maybe you want to provide a with_task_timing_enabled(f) variant too? Not sure.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no, for this PR, i don't think so, because we don't have a use for it right now, and because (as of this PR) the only way to write such a function would be enabling the task-timings process-wide, whereas ideally i think such a function would only enable task-timings for task created by f

Comment on lines +1367 to +1373
@test cpu_time_i > 0
@test wall_time_i > 0
# task cpu-time cannot be greater than its wall-time
@test wall_time_i >= cpu_time_i
# task wall-time must be less than our manually measured wall-time
# between calling `@spawn` and returning from `wait`.
@test wall_time_delta_i > wall_time_i
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think you're allowed to put @tests in spawned tasks, right? Or maybe you can, but the reporting is weird? I can't remember. Do you need to introduce a new @testset within this task?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

currently @test failures in a spawned task will cause the task to fail (and therefore the outer testset will error), but their pass/fail status won't be reported up to the enclosing/outer @testset so the reporting is a bit yanky but i think this is still fine way to test.

e.g. for a testset like this:

julia> @testset begin
           wait(Threads.@spawn @test 1==2)
           @test 3==3
       end

ideally it'd report 1 failure and 1 pass
but what it currently does is report a single error

Test Failed at REPL[2]:2
  Expression: 1 == 2
   Evaluated: 1 == 2

test set: Error During Test at REPL[2]:1
  Got exception outside of a @test
  TaskFailedException
   ...
      nested task error: There was an error during testing
Test Summary: | Error  Total  Time
test set      |     1      1  0.0s
ERROR: Some tests did not pass: 0 passed, 0 failed, 1 errored, 0 broken.

Base.task_timing(false)
end
end

end # main testset
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe you could add a bit more testing around edge cases, like if you create a Task but don't run it, what its numbers look like, or if you create a task and try to run it, but you provide a bad value for the function so it never actually runs, what happens to the numbers.

And what happens if a task throws an exception, do its numbers still get cleaned up correctly (they don't keep incrementing), etc.

Comment on lines +1327 to +1330
@test Base.task_cpu_time_ns(t) > 0
@test Base.task_wall_time_ns(t) > 0
@test Base.task_wall_time_ns(t) >= Base.task_cpu_time_ns(t)
@test wall_time_delta > Base.task_wall_time_ns(t)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a tiny test that these aren't still ticking up? That they're fixed to a value once the task ends.

base/task.jl Outdated
if t.is_timing_enabled
@assert t.last_dequeued_at != 0
t.cpu_time_ns += stopped_at - t.last_dequeued_at
t.last_dequeued_at = 0
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you ever check/rely anywhere that this is reset to 0?

It seems unnecessary to me (and we want to keep task-switches as cheap as we can, even though this will probably be very cheap since we'll already have the task resident in-cache, but still writes to main memory are slowish). And maybe someone would even want to look at the last-ran-time even for a sleeping task if they were trying to debug things? Not sure though.

base/task.jl Outdated Show resolved Hide resolved
base/task.jl Outdated Show resolved Hide resolved
base/task.jl Outdated Show resolved Hide resolved
base/task.jl Show resolved Hide resolved
base/task.jl Outdated Show resolved Hide resolved
base/task.jl Outdated Show resolved Hide resolved
base/task.jl Outdated Show resolved Hide resolved
Comment on lines +2294 to +2295
// timestamp this task first entered the run queue (TODO: int32 of ms instead?)
uint64_t first_enqueued_at;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think it's valuable to also have created_at?

It might be interesting in some cases to see the difference between the created-at and first-scheduled-at timestamps? Maybe someone would be surprised to see that gap being large? Not sure though.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, maybe, but i didn't particularly want to add things i wasn't adding a public API for... what do you think? worth adding? if so, worth adding a public API?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 that it should either have a public API or not exist at all.

🤔 My instinct was to say "just add it," but i know that we want to keep the struct smallish if possible for data-locality so I think you're right to leave it off unless we have a use for it.

Copy link
Member

@NHDaly NHDaly left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Coooooooool! 😁 Looks great from my perspective!

I'll let everyone else finish their reviews and your responses and such, but this LGTM from my perspective. 😎

src/task.c Outdated Show resolved Hide resolved
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At first I was going to suggest that it might actually be nice to have an ascii-art state diagram for the state transitions you've added here...

... But I guess for now there'd only be two states: user_time and wait_time, with all the like 15 different arrows you've added.
... so maybe that's actually not super useful right now and it makes more sense to just grep the code base for -> .

Which actually, maybe you can just make them more searchable, by adding a keyword? Like:

# [Task timings]: user_time -task-finished-> wait_time

at the front of each arrow or something?

@gbaraldi
Copy link
Member

So one thing I was thinking of to avoid the branch is that maybe we should use one of the timing registers and unconditionally store the values. I.e use cycleclock which maps to RDTSC in x86 for example. That requires implementing the logic for that or using the llvm intrinsic https://llvm.org/docs/LangRef.html#llvm-readcyclecounter-intrinsic. @vtjnash what do you think?

@nickrobinson251
Copy link
Contributor Author

Thanks for all the great reviews so far! on wednesday i'll try to get this PR finished up for a final review

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature Indicates new feature / enhancement requests multithreading Base.Threads and related functionality needs news A NEWS entry is required for this change
Projects
None yet
Development

Successfully merging this pull request may close these issues.

observability: add CPU time counter to tasks
8 participants