-
Notifications
You must be signed in to change notification settings - Fork 59
Simplify and slightly optimise how tasks are stored and selected #1447
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/// The task thread lock must be held and there must be more than 1 frame context. | ||
pub(crate) fn advance_first(&self, n_fc: usize) { | ||
// This read-modify-write doesn't need to be an atomic CAS: | ||
// modifications are protected by the lock that we're holding. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we already have a lock for this, why is SeqCst
needed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Especially because the .load
and .fetch_add(1
aren't atomic together.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't need to fix this right now, but I'm wondering.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a good question. At the moment it's a direct move of the existing code. I did have a commit which I didn't include which made it Relaxed, but I didn't include it because it starts to get quite complex to tease out what is and isn't an access which might require a more rigorous ordering.
I'll have another look at including that commit in the series; the other thing I wanted to compare was putting cur
, reset_task_cur
and first
in a lock all of their own so that there was just one single acquire/release pair of atomic operations, and it would simplify a lot of the code.
Thanks, when I get a chance I will apply that feedback. |
Here's some performance figures from
As you can see it's a bit noisy and random, but there is a measurable difference across inputs and thread counts. I found it very odd that the single-thread case also went faster, but I haven't actually looked into how the single-thread case works; if it's the same in terms of finding and running tasks then it does make sense that these improvements would help there too. |
This is both simpler to follow and also seems to be a bit faster.
No behavioural change.
We keep largely the same algorithm and variable meanings, but we make things a lot simpler in how they're coded up, and we document a lot more.
When iterating tasks, we used to need to do certain things at the end of the loop. We no longer need to do those, so we can get rid of a block and reduce indentation and improve clarity. Broken out from the other changes to make it clearer that the bulk of the change is whitespace rather than substantive.
Even though `.try_read().unwrap()` in `.index()` and `.len()` is guaranteed to succeed, it's not free. Try to do it less. This requries that we have slightly more involved handling for when we have to insert a new task as a result of finding a task, but it means that we don't have to take the read lock every iteration of the loop.
Looks like your x86 runner needs an updated |
/// - `Less` if `self` is of higher priority than `other` | ||
/// - `Greater` if `self` is of lower priority than `other` | ||
/// | ||
/// This is farily straightforwardly translated from `insert_tasks` in `task_thread.c`, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/// This is farily straightforwardly translated from `insert_tasks` in `task_thread.c`, | |
/// This is fairly straightforwardly translated from `insert_tasks` in `task_thread.c`, |
/// | ||
/// This is farily straightforwardly translated from `insert_tasks` in `task_thread.c`, | ||
/// and as such inherits the limitations of that function. Specifically, | ||
/// it requires that there are no Init, InitCdf or film grain tasks, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you link these, too? The film grain tasks are just FgPrep
and FgApply
, right?
return false; | ||
} | ||
let start = tasks.len() as u32; | ||
tasks.extend(pending_tasks.drain(..)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
tasks.append(pending_tasks)
should be more efficient.
{ | ||
let mut pending_tasks = self.pending_tasks.lock(); | ||
if pending_tasks.len() == 0 { | ||
// This is a 'spurious' merge: that is, `merge` is set but the vector is empty. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the optimization from skipping the actual merge (.extend
and .drain
, or more efficiently, .append), or from returning
false`?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not an optimisation, it's to avoid a crash below when we access tasks[new_start]
, which will panic if we added no tasks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ahh.
} | ||
// Unlike in the C code, we only do a single `insert_tasks`. There are two reasons: | ||
// | ||
// - in the C code, the `insert_tasks` is what creates the sorted linked list. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit:
// - in the C code, the `insert_tasks` is what creates the sorted linked list. | |
// - In the C code, the `insert_tasks` is what creates the sorted linked list. |
// and nothing is changed by trying to repeatedly do the same update. | ||
let frame_idx = tasks[new_start].frame_idx; | ||
self.insert_tasks(c, frame_idx, 0); | ||
tasks.sort(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure, but is a stable sort needed here? I would think it isn't, as fn cmp
doesn't even ever return Equal
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Huh, yeah, we could use an unstable sort. When I get a chance I will swap that in and benchmark it.
/// and as such inherits the limitations of that function. Specifically, | ||
/// it requires that there are no Init, InitCdf or film grain tasks, | ||
/// and that there are no duplicate tasks. | ||
fn cmp(&self, other: &Self) -> cmp::Ordering { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can do this in a separate PR (if I'm right), but I think this could be greatly simplified to this:
fn cmp(&self, other: &Self) -> cmp::Ordering {
(self.type_0, self.sby, self.tile_idx).cmp(&(other.type_0, other.sby, other.tile_idx))
}
Where TaskType
is also changed to delete the manual values (they don't seem to be used) and TileEntropy
is moved to be first.
I haven't thought this through super closely to make sure I'm not messing anything up, but it would greatly simplify this and should make it easier to understand.
Can also be written as
impl Rav1dTask {
fn cmp_fields(&self) -> impl Ord {
(self.type_0, self.sby, self.tile_idx)
}
}
impl Ord for Rav1dTask {
fn cmp(&self, other: &Self) -> cmp::Ordering {
self.cmp_fields().cmp(&other.cmp_fields())
}
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess we do do t.type_0 > TaskType::InitCdf
, which this order would change. But we can probably get around that and still have the much simpler fn cmp
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wait, if we're already assuming that there aren't any Init
or InitCdf
s, then TileEntropy
is the first anyways. So no ordering changes needed.
let idx = match tasks.binary_search(&task) { | ||
Ok(_) => panic!("Tried to add an identical task!"), | ||
Err(idx) => idx, | ||
}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The code didn't use to do this, right? Maybe just make it a debug_assertions
-only check?
I'm back at work this week so it may be a little while before I am able to look at this more. I also need to think a bit more about some of my analysis of the atomic ordering around |
Huh, yeah, I think the code actually cannot have If we make I'll update this PR to remove the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like your x86 runner needs an updated
perf
.
@thedataking, can you update the kernel headers perf
needs?
See the error here: https://github.com/memorysafety/rav1d/actions/runs/16540023760/job/46779890408.
As far as I can tell, we have the right packages ( Update: I removed |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This improves performance on an 80-core arm machine running with 150 threads by 6.7%, going from 34% on main
to 28% here (% is overhead compared to dav1d
). So I definitely think this is something we should ultimately merge and further explore.
./benchmark.py --threads 150 --commit main..daxtens/tasks
all commits in main..daxtens/tasks
, recursing into ones > 1.0%:
commit # | commit hash | threads | diff % | rav1d s | dav1d s | error |
---|---|---|---|---|---|---|
0 | d4807cae | 150 | 34.0% | 0.915 s | 0.683 s | |
1 | 69be19bf | 150 | 32.5% | 0.907 s | 0.684 s | |
2 | d64ddd08 | 150 | 30.9% | 0.897 s | 0.685 s | |
3 | 5af4d9e7 | 150 | 31.0% | 0.897 s | 0.685 s | |
4 | 54456cbd | 150 | 30.5% | 0.895 s | 0.686 s | |
5 | fdc68f31 | 150 | 29.4% | 0.885 s | 0.684 s | |
6 | 61026da6 | 150 | 27.9% | 0.875 s | 0.684 s |
sorted adjacent diffs of diffs:
diff of diff % | commit # | commit hash | threads | diff % | rav1d s | dav1d s |
---|---|---|---|---|---|---|
-1.6% | 2 | d64ddd08 | 150 | 30.9% | 0.897 s | 0.685 s |
-1.5% | 6 | 61026da6 | 150 | 27.9% | 0.875 s | 0.684 s |
-1.5% | 1 | 69be19bf | 150 | 32.5% | 0.907 s | 0.684 s |
-1.1% | 5 | fdc68f31 | 150 | 29.4% | 0.885 s | 0.684 s |
-0.5% | 4 | 54456cbd | 150 | 30.5% | 0.895 s | 0.686 s |
+0.0% | 3 | 5af4d9e7 | 150 | 31.0% | 0.897 s | 0.685 s |
Ooo that’s cool!!
I’m sorry that between work and my personal life I haven’t had any time to
look back at this. I am hoping to get back to it at some point but can’t
make any promises on timeframe. I am also very happy for anyone else to
take it on - I’m not precious about credit or prize money.
… Message ID: ***@***.***>
|
I had hoped to be able to optimise multithreaded behaviour. Unfortunately, I found it almost impossible to make heads or tails of what was going on. This PR is an attempt to unpick a couple of the complicated parts, replacing them with simpler and more documented implementations. As a happy side-effect, it's also a bit faster.
There are a set of 'frame contexts' for frames that we are decoding, and each frame context holds a dynamic set of tasks required to decode the frame. Some of these tasks can be executed in parallel. The tasks have dependencies, so not every task being tracked for a frame context is necessarily executable at any given point in time: tasks, and indeed entire frame contexts, can be 'blocked'.
Previously, these tasks were stored in an array in order of insertion. To execute them in priority order, each task stored the index of the next highest priority task, and the structure holding the tasks kept the index of the highest 'live' task. Remove this linked-list implementation, and instead implement
PartialOrd
andOrd
forRav1dTask
, and use Rust'sVec
's search and sort methods to keep the tasks in order. Fascinatingly, despite seeming like more work (especially in terms of copying tasks around), this seems to go faster. (Possibly because we calledindex
a lot?index
takes a read lock on the vector each time.)As checking to see if a task is blocked is not free, we also keep track of completely blocked frame contexts and the most recent frame context to have had progress. Because a frame can depend on information from a previous frame, if a frame context becomes unblocked, that and all subsequent frame contexts should be checked for executable tasks too. This process was tracked by the
first
,cur
, andreset_cur_index
variables and mostly managed by thereset_task_cur
andreset_task_cur_async
functions. These functions were complicated and their documentation was not particularly easy to follow. Replace them with simpler ones. The original ones also had a number of attempted optimisations, not all of which seem to be worth carrying over.Finally, we make various passing optimisations:
try_read()
—so much the better.I am in the process of doing a final set of performance measurements on my latest tweaks, and I will update this when I have final results. Previous testing at 10, 6 and 2 threads on my 6 core AMD Ryzen box with Windows 11 indicates improvements somewhere less than 1% for the 10 bit Chimera file. For the 8 bit file, the change was more in the noise, but at least not slower. A fair amount of variability is also present. I haven't tested single-threaded performance; I assume it will be unchanged.
I would mostly advocate for these changes in terms of their simplicity: it's nice that they're faster but even if they made no performance change at all I think the simplicity is worth the churn.
Other notes: