Skip to content

Conversation

daxtens
Copy link
Contributor

@daxtens daxtens commented Jul 22, 2025

I had hoped to be able to optimise multithreaded behaviour. Unfortunately, I found it almost impossible to make heads or tails of what was going on. This PR is an attempt to unpick a couple of the complicated parts, replacing them with simpler and more documented implementations. As a happy side-effect, it's also a bit faster.

There are a set of 'frame contexts' for frames that we are decoding, and each frame context holds a dynamic set of tasks required to decode the frame. Some of these tasks can be executed in parallel. The tasks have dependencies, so not every task being tracked for a frame context is necessarily executable at any given point in time: tasks, and indeed entire frame contexts, can be 'blocked'.

Previously, these tasks were stored in an array in order of insertion. To execute them in priority order, each task stored the index of the next highest priority task, and the structure holding the tasks kept the index of the highest 'live' task. Remove this linked-list implementation, and instead implement PartialOrd and Ord for Rav1dTask, and use Rust's Vec's search and sort methods to keep the tasks in order. Fascinatingly, despite seeming like more work (especially in terms of copying tasks around), this seems to go faster. (Possibly because we called index a lot? index takes a read lock on the vector each time.)

As checking to see if a task is blocked is not free, we also keep track of completely blocked frame contexts and the most recent frame context to have had progress. Because a frame can depend on information from a previous frame, if a frame context becomes unblocked, that and all subsequent frame contexts should be checked for executable tasks too. This process was tracked by the first, cur, and reset_cur_index variables and mostly managed by the reset_task_cur and reset_task_cur_async functions. These functions were complicated and their documentation was not particularly easy to follow. Replace them with simpler ones. The original ones also had a number of attempted optimisations, not all of which seem to be worth carrying over.

Finally, we make various passing optimisations:

  • it's not necessary to repeatedly mark the same frame context as having new tasks when we're merging pending tasks
  • if we can avoid repeatedly taking a read-lock—even when we're using try_read()—so much the better.

I am in the process of doing a final set of performance measurements on my latest tweaks, and I will update this when I have final results. Previous testing at 10, 6 and 2 threads on my 6 core AMD Ryzen box with Windows 11 indicates improvements somewhere less than 1% for the 10 bit Chimera file. For the 8 bit file, the change was more in the noise, but at least not slower. A fair amount of variability is also present. I haven't tested single-threaded performance; I assume it will be unchanged.

I would mostly advocate for these changes in terms of their simplicity: it's nice that they're faster but even if they made no performance change at all I think the simplicity is worth the churn.

Other notes:

  • I'm still not 100% clear on how I should be formatting comments so my apologies if they need work.
  • I've structured this to be read commit-by-commit.
  • There is definitely more that could be done here: for example, there are definitely possible optimisations to the way that delayed film grain is handled. But I need to take a break for a while and I want to at least try to land something!

@kkysen kkysen self-requested a review July 22, 2025 14:42
Copy link
Collaborator

@kkysen kkysen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still reviewing ce64d3c and b4e4bb9, but here's some initial feedback. It looks very nice so far.

/// The task thread lock must be held and there must be more than 1 frame context.
pub(crate) fn advance_first(&self, n_fc: usize) {
// This read-modify-write doesn't need to be an atomic CAS:
// modifications are protected by the lock that we're holding.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we already have a lock for this, why is SeqCst needed?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Especially because the .load and .fetch_add(1 aren't atomic together.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't need to fix this right now, but I'm wondering.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a good question. At the moment it's a direct move of the existing code. I did have a commit which I didn't include which made it Relaxed, but I didn't include it because it starts to get quite complex to tease out what is and isn't an access which might require a more rigorous ordering.

I'll have another look at including that commit in the series; the other thing I wanted to compare was putting cur, reset_task_cur and first in a lock all of their own so that there was just one single acquire/release pair of atomic operations, and it would simplify a lot of the code.

@daxtens
Copy link
Contributor Author

daxtens commented Jul 23, 2025

Thanks, when I get a chance I will apply that feedback.

@daxtens
Copy link
Contributor Author

daxtens commented Jul 24, 2025

Here's some performance figures from hyperfine, Windows 11, AMD Ryzen something or other with 6 cores, 12 hyperthreads. I ran with 4 warmup rounds. I ran the tests multiple times, some had outliers, others were just a bit 'weird', so I've just supplied 1 figure that I feel is at least somewhat representative of the common case. As such, please do take these with a definite grain of salt.

File Threads main daxtens/tasks Improvement
8 bit 1 thread 149.6 s ± 0.169 s 148.5 s ± 0.219 s 0.7%
8 bit 2 threads 85.711 s ± 0.572 s 85.249 s ± 0.556 s 0.5%
8 bit 6 threads 32.325 s ± 0.101 s 32.195 s ± 0.140 s 0.4%
8 bit 10 threads 28.225 s ± 0.080 s 28.123 s ± 0.067 s 0.4%
10 bit 1 thread 196.0 s ± 0.330 s 194.8 ± 0.224 s 0.6%
10 bit 2 threads 114.747 s ± 0.584 s 114.160 s ± 0.516 s 0.5%
10 bit 6 threads 47.555 s ± 0.069 s 47.123 s ± 0.104 s 0.9%
10 bit 10 threads 43.752 s ± 0.097 s 43.648 s ± 0.163 s 0.2%

As you can see it's a bit noisy and random, but there is a measurable difference across inputs and thread counts.

I found it very odd that the single-thread case also went faster, but I haven't actually looked into how the single-thread case works; if it's the same in terms of finding and running tasks then it does make sense that these improvements would help there too.

daxtens added 6 commits July 26, 2025 16:31
This is both simpler to follow and also seems to be a bit faster.
We keep largely the same algorithm and variable meanings, but we make things a lot simpler
in how they're coded up, and we document a lot more.
When iterating tasks, we used to need to do certain things at the end of the loop.
We no longer need to do those, so we can get rid of a block and reduce indentation
and improve clarity.

Broken out from the other changes to make it clearer that the bulk of the change
is whitespace rather than substantive.
Even though `.try_read().unwrap()` in `.index()` and `.len()` is guaranteed to succeed,
it's not free. Try to do it less.

This requries that we have slightly more involved handling for when we have to insert
a new task as a result of finding a task, but it means that we don't have to take
the read lock every iteration of the loop.
@daxtens
Copy link
Contributor Author

daxtens commented Jul 26, 2025

Looks like your x86 runner needs an updated perf.

/// - `Less` if `self` is of higher priority than `other`
/// - `Greater` if `self` is of lower priority than `other`
///
/// This is farily straightforwardly translated from `insert_tasks` in `task_thread.c`,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
/// This is farily straightforwardly translated from `insert_tasks` in `task_thread.c`,
/// This is fairly straightforwardly translated from `insert_tasks` in `task_thread.c`,

///
/// This is farily straightforwardly translated from `insert_tasks` in `task_thread.c`,
/// and as such inherits the limitations of that function. Specifically,
/// it requires that there are no Init, InitCdf or film grain tasks,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you link these, too? The film grain tasks are just FgPrep and FgApply, right?

return false;
}
let start = tasks.len() as u32;
tasks.extend(pending_tasks.drain(..));
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tasks.append(pending_tasks) should be more efficient.

{
let mut pending_tasks = self.pending_tasks.lock();
if pending_tasks.len() == 0 {
// This is a 'spurious' merge: that is, `merge` is set but the vector is empty.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the optimization from skipping the actual merge (.extend and .drain, or more efficiently, .append), or from returning false`?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not an optimisation, it's to avoid a crash below when we access tasks[new_start], which will panic if we added no tasks.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ahh.

}
// Unlike in the C code, we only do a single `insert_tasks`. There are two reasons:
//
// - in the C code, the `insert_tasks` is what creates the sorted linked list.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit:

Suggested change
// - in the C code, the `insert_tasks` is what creates the sorted linked list.
// - In the C code, the `insert_tasks` is what creates the sorted linked list.

// and nothing is changed by trying to repeatedly do the same update.
let frame_idx = tasks[new_start].frame_idx;
self.insert_tasks(c, frame_idx, 0);
tasks.sort();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure, but is a stable sort needed here? I would think it isn't, as fn cmp doesn't even ever return Equal.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Huh, yeah, we could use an unstable sort. When I get a chance I will swap that in and benchmark it.

/// and as such inherits the limitations of that function. Specifically,
/// it requires that there are no Init, InitCdf or film grain tasks,
/// and that there are no duplicate tasks.
fn cmp(&self, other: &Self) -> cmp::Ordering {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can do this in a separate PR (if I'm right), but I think this could be greatly simplified to this:

    fn cmp(&self, other: &Self) -> cmp::Ordering {
        (self.type_0, self.sby, self.tile_idx).cmp(&(other.type_0, other.sby, other.tile_idx))
    }

Where TaskType is also changed to delete the manual values (they don't seem to be used) and TileEntropy is moved to be first.

I haven't thought this through super closely to make sure I'm not messing anything up, but it would greatly simplify this and should make it easier to understand.

Can also be written as

impl Rav1dTask {
    fn cmp_fields(&self) -> impl Ord {
        (self.type_0, self.sby, self.tile_idx)
    }
}

impl Ord for Rav1dTask {
    fn cmp(&self, other: &Self) -> cmp::Ordering {
        self.cmp_fields().cmp(&other.cmp_fields())
    }
}

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess we do do t.type_0 > TaskType::InitCdf, which this order would change. But we can probably get around that and still have the much simpler fn cmp.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wait, if we're already assuming that there aren't any Init or InitCdfs, then TileEntropy is the first anyways. So no ordering changes needed.

Comment on lines +59 to +62
let idx = match tasks.binary_search(&task) {
Ok(_) => panic!("Tried to add an identical task!"),
Err(idx) => idx,
};
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code didn't use to do this, right? Maybe just make it a debug_assertions-only check?

@daxtens
Copy link
Contributor Author

daxtens commented Jul 27, 2025

I'm back at work this week so it may be a little while before I am able to look at this more.

I also need to think a bit more about some of my analysis of the atomic ordering around first and reset_task_cur; I think the code may actually be safer than I thought with SeqCst; I think some of the stuff around wrapping_sub might actually not be needed unless we make some things Relaxed. I'll let you know what I manage to work out.

@daxtens
Copy link
Contributor Author

daxtens commented Jul 27, 2025

Huh, yeah, I think the code actually cannot have first and reset_task_cur fall out of sync. IOW, it's always the case that reset_task_cur >= first except for a brief moment when mark_new_task_for_index and advance_first are running concurrently. Because reset_cur_index is always run under the same lock as advance_first, it's not possible for it to observe any case where reset_task_cur is less than first.

If we make first Relaxed, it's suddenly very easy for that to happen on a weakly ordered architecture.

I'll update this PR to remove the wrapping_sub from the existing commits - if I do get a relaxed solution I'm happy with, I'll put it in a separate commit that adds the wrapping sub back and adds the requisite explanation. I currently think that given the way the old code worked, we would be OK with the slight loss of synchronisation that results from using Relaxed, but given how tricky this code is I'd want to think more carefully. Using two variables with seqcst like this is not a pattern I'm used to and so it's tricky for me to reason about.

Copy link
Collaborator

@kkysen kkysen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like your x86 runner needs an updated perf.

@thedataking, can you update the kernel headers perf needs?

See the error here: https://github.com/memorysafety/rav1d/actions/runs/16540023760/job/46779890408.

@thedataking
Copy link
Collaborator

thedataking commented Aug 13, 2025

Looks like your x86 runner needs an updated perf.

See the error here: https://github.com/memorysafety/rav1d/actions/runs/16540023760/job/46779890408.

As far as I can tell, we have the right packages (linux-tools-generic) installed but we are hitting the bug described her https://bugs.launchpad.net/ubuntu/+source/linux-hwe-6.14/+bug/2117147

Update: I removed linux-tools-generic and built perf directly from an upstream kernel. might work now 🤞

Copy link
Collaborator

@kkysen kkysen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This improves performance on an 80-core arm machine running with 150 threads by 6.7%, going from 34% on main to 28% here (% is overhead compared to dav1d). So I definitely think this is something we should ultimately merge and further explore.

./benchmark.py --threads 150 --commit main..daxtens/tasks

all commits in main..daxtens/tasks, recursing into ones > 1.0%:

commit # commit hash threads diff % rav1d s dav1d s error
0 d4807cae 150 34.0% 0.915 s 0.683 s
1 69be19bf 150 32.5% 0.907 s 0.684 s
2 d64ddd08 150 30.9% 0.897 s 0.685 s
3 5af4d9e7 150 31.0% 0.897 s 0.685 s
4 54456cbd 150 30.5% 0.895 s 0.686 s
5 fdc68f31 150 29.4% 0.885 s 0.684 s
6 61026da6 150 27.9% 0.875 s 0.684 s

sorted adjacent diffs of diffs:

diff of diff % commit # commit hash threads diff % rav1d s dav1d s
-1.6% 2 d64ddd08 150 30.9% 0.897 s 0.685 s
-1.5% 6 61026da6 150 27.9% 0.875 s 0.684 s
-1.5% 1 69be19bf 150 32.5% 0.907 s 0.684 s
-1.1% 5 fdc68f31 150 29.4% 0.885 s 0.684 s
-0.5% 4 54456cbd 150 30.5% 0.895 s 0.686 s
+0.0% 3 5af4d9e7 150 31.0% 0.897 s 0.685 s

@daxtens
Copy link
Contributor Author

daxtens commented Sep 30, 2025 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants